MMTSA

Author:

Gao Ziqi1ORCID,wang Yuntao2ORCID,Chen Jianguo3ORCID,Xing Junliang4ORCID,Patel Shwetak5ORCID,Liu Xin5ORCID,Shi Yuanchun6ORCID

Affiliation:

1. Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Global Innovation Exchange (GIX) Institute, Tsinghua University, Beijing, China

2. Key Laboratory of Pervasive Computing, Ministry of Education, Department of Computer Science and Technology, Tsinghua University, Beijing, China and Department of Computer Technology and Application, Qinghai University, Xining, Qinghai, China

3. University of Virginia, Charlottesville, VA, USA

4. Department of Computer Science and Technology, Tsinghua University, Beijing, China

5. Paul G. Allen School for Computer Science and Engineering, University of Washington, Seattle, WA, USA

6. Department of Computer Science and Technology, Tsinghua University, Beijing, China and Qinghai University, Xining, Qinghai, China

Abstract

Multimodal sensors provide complementary information to develop accurate machine-learning methods for human activity recognition (HAR), but introduce significantly higher computational load, which reduces efficiency. This paper proposes an efficient multimodal neural architecture for HAR using an RGB camera and inertial measurement units (IMUs) called Multimodal Temporal Segment Attention Network (MMTSA). MMTSA first transforms IMU sensor data into a temporal and structure-preserving gray-scale image using the Gramian Angular Field (GAF), representing the inherent properties of human activities. MMTSA then applies a multimodal sparse sampling method to reduce data redundancy. Lastly, MMTSA adopts an inter-segment attention module for efficient multimodal fusion. Using three well-established public datasets, we evaluated MMTSA's effectiveness and efficiency in HAR. Results show that our method achieves superior performance improvements (11.13% of cross-subject F1-score on the MMAct dataset) than the previous state-of-the-art (SOTA) methods. The ablation study and analysis suggest that MMTSA's effectiveness in fusing multimodal data for accurate HAR. The efficiency evaluation on an edge device showed that MMTSA achieved significantly better accuracy, lower computational load, and lower inference latency than SOTA methods.

Funder

Tsinghua University Initiative Scientifc Research Program

Institute for Artifcial Intelligence, Tsinghua University

Natural Science Foundation of China

Young Elite Scientists Sponsorship Program by CAST

Beijing Key Lab of Networked Multimedia

Beijing National Research Center for Information Science and Technology

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture,Human-Computer Interaction

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Integrating Gaze and Mouse Via Joint Cross-Attention Fusion Net for Students' Activity Recognition in E-learning;Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies;2023-09-27

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3