From CNNs to Transformers in Multimodal Human Action Recognition: A Survey-Reference-Cited by-同舟云学术

From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Published:2024-07-09 Issue:8 Volume:20 Page:1-24
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Shaikh Muhammad Bilal¹^ORCID,Chai Douglas²^ORCID,Islam Syed Muhammad Shamsul³^ORCID,Akhtar Naveed⁴^ORCID

Affiliation:

1. School of Engineering, Edith Cowan University, Joondalup, Australia and Molycop, Balcatta, Australia

2. School of Engineering, Edith Cowan University, Joondalup, Australia

3. School of Science, Edith Cowan University, Joondalup, Australia

4. The University of Melbourne, Melbourne, Australia

Abstract

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the past decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of ‘fusing’ the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.

Funder

Edith Cowan University (ECU) and the Higher Education Commission (HEC) of Pakistan

Office of National Intelligence National Intelligence Postdoctoral

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3664815

Reference161 articles.

1. Gregory D. Abowd, Anind K. Dey, Peter J. Brown et al. 1999. Towards a better understanding of context and context-awareness. In Proceedings of the International Symposium on Handheld and Ubiquitous Computing. Springer, 304–307.

2. Human activity recognition from 3D data: A review;Aggarwal J. K.;Pattern Recogn. Lett.,2014

3. Multi-sensor fusion for activity recognition: A survey;Aguileta Antonio A.;Sensors,2019

4. How deep features have improved event recognition in multimedia: A survey;Ahmad Kashif;ACM Trans. Multimedia Comput. Commun. Appl.,2019

5. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text;Akbari Hassan;Adv. Neural Info. Process. Syst.,2021