From CNNs to Transformers in Multimodal Human Action Recognition: A Survey

Author:

Shaikh Muhammad Bilal1ORCID,Chai Douglas2ORCID,Islam Syed Muhammad Shamsul3ORCID,Akhtar Naveed4ORCID

Affiliation:

1. School of Engineering, Edith Cowan University, Joondalup, Australia and Molycop, Balcatta, Australia

2. School of Engineering, Edith Cowan University, Joondalup, Australia

3. School of Science, Edith Cowan University, Joondalup, Australia

4. The University of Melbourne, Melbourne, Australia

Abstract

Due to its widespread applications, human action recognition is one of the most widely studied research problems in Computer Vision. Recent studies have shown that addressing it using multimodal data leads to superior performance as compared to relying on a single data modality. During the adoption of deep learning for visual modelling in the past decade, action recognition approaches have mainly relied on Convolutional Neural Networks (CNNs). However, the recent rise of Transformers in visual modelling is now also causing a paradigm shift for the action recognition task. This survey captures this transition while focusing on Multimodal Human Action Recognition (MHAR). Unique to the induction of multimodal computational models is the process of ‘fusing’ the features of the individual data modalities. Hence, we specifically focus on the fusion design aspects of the MHAR approaches. We analyze the classic and emerging techniques in this regard, while also highlighting the popular trends in the adaption of CNN and Transformer building blocks for the overall problem. In particular, we emphasize on recent design choices that have led to more efficient MHAR models. Unlike existing reviews, which discuss Human Action Recognition from a broad perspective, this survey is specifically aimed at pushing the boundaries of MHAR research by identifying promising architectural and fusion design choices to train practicable models. We also provide an outlook of the multimodal datasets from their scale and evaluation viewpoint. Finally, building on the reviewed literature, we discuss the challenges and future avenues for MHAR.

Funder

Edith Cowan University (ECU) and the Higher Education Commission (HEC) of Pakistan

Office of National Intelligence National Intelligence Postdoctoral

Publisher

Association for Computing Machinery (ACM)

Reference161 articles.

1. Gregory D. Abowd, Anind K. Dey, Peter J. Brown et al. 1999. Towards a better understanding of context and context-awareness. In Proceedings of the International Symposium on Handheld and Ubiquitous Computing. Springer, 304–307.

2. Human activity recognition from 3D data: A review;Aggarwal J. K.;Pattern Recogn. Lett.,2014

3. Multi-sensor fusion for activity recognition: A survey;Aguileta Antonio A.;Sensors,2019

4. How deep features have improved event recognition in multimedia: A survey;Ahmad Kashif;ACM Trans. Multimedia Comput. Commun. Appl.,2019

5. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text;Akbari Hassan;Adv. Neural Info. Process. Syst.,2021

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3