ACM ICMR 2022 Special Session

Transformer-based Multimedia Understanding: Model Design, Learning, Distillation

The sound understanding of human videos and images is the preliminary of subsequent tasks (to intelligently interact with humans). Typical visual understanding tasks involve human pose estimation, pedestrian tracking, action recognition, motion prediction, etc. Recent studies have concentrated not only on developing a high-accuracy understanding model but also on maximizing efficiency. The real world is a complex scenario. The quality of videos and images is always influenced by various factors, such as noise, blur, haze, lowlight, etc. Moreover, various human behaviors will also cause problems for understanding, even if the environment is fine, such as wearing makeup, occlusion, ambiguous expression, abnormal expression (special crowd), etc. All these situations bring great challenges for high quality inference.

To address the aforementioned problem, there are promising technologies such as, Transformer, self-attention mechanism, model self-distillation, event camera, and so on. The vision transformer has a bigger receptive field, which means it can handle tricky problems. Self-attention can grasp the points based on different inputs. Model distillation promises an efficient but workable model. The emerging event camera brings a new perspective to expression and action recognition.

This special session aims at bringing together researchers and professionals from academia and industry from around the world for showcasing, discussing, and reviewing the whole spectrum of technological opportunities, challenges, solutions, and emerging applications in efficient but robust intelligent multimedia understanding in complex scenarios using state-of-the-art deep learning techniques.

Topics of particular interest include, but are not limited to:

 - Transformer exploiting on image pre-processing;
 - Intelligent multimedia understanding in makeup or occlusion scenario;
 - Slim model designing through model distillation;
 - Model distillation and transformer designing for the special crowd expression;
 - Few-shot or meta action learning for special crowd;
 - Self-attention mechanism designing for facial expression recognition;
 - Model distillation designing for facial expressions lacking labels;
 - Video transformer for human expressions and actions reorganization;
 - Transformer designing for event cameras;

Maximum Length of a Paper

Each full paper should be limited to 6-8 pages (6 pages limit + references).

Important Dates

Paper Submission: January 20, 2022 January 30, 2022 (Extended!)
Notification of Acceptance: April 3, 2022
Camera-Ready Papers Due: April 17, 2022

Submission Instructions

See the ICMR 2022 Paper submission section.


 - Lin Li, Prof., Wuhan University of Technology, China, E-mail:
 - Jianquan Liu, Dr., NEC Corporation, Japan, E-mail:
 - Meng Fang, Dr., Eindhoven University of Technology, Netherland, E-mail:
 - Yang Wang, Prof., Hefei University of Technology, China, E-mail: