A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition-Reference-Cited by-同舟云学术

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Published:2023-02-05 Issue:4 Volume:13 Page:2058
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Shi Jing¹,Zhang Yuanyuan¹,Wang Weihang¹^ORCID,Xing Bin²,Hu Dasha¹,Chen Liangyin¹³^ORCID

Affiliation:

1. School of Computer Science, Sichuan University, Chengdu 610065, China

2. Chongqing Innovation Center of Industrial Big-Data Co., Ltd., Chongqing 400707, China

3. Institute for Industrial Internet Research, Sichuan University, Chengdu 610065, China

Abstract

Due to the great success of Vision Transformer (ViT) in image classification tasks, many pure Transformer architectures for human action recognition have been proposed. However, very few works have attempted to use Transformer to conduct bimodal action recognition, i.e., both skeleton and RGB modalities for action recognition. As proved in many previous works, RGB modality and skeleton modality are complementary to each other in human action recognition tasks. How to use both RGB and skeleton modalities for action recognition in a Transformer-based framework is a challenge. In this paper, we propose RGBSformer, a novel two-stream pure Transformer-based framework for human action recognition using both RGB and skeleton modalities. Using only RGB videos, we can acquire skeleton data and generate corresponding skeleton heatmaps. Then, we input skeleton heatmaps and RGB frames to Transformer at different temporal and spatial resolutions. Because the skeleton heatmaps are primary features compared to the original RGB frames, we use fewer attention layers in the skeleton stream. At the same time, two ways are proposed to fuse the information of two streams. Experiments demonstrate that the proposed framework achieves the state of the art on four benchmarks: three widely used datasets, Kinetics400, NTU RGB+D 60, and NTU RGB+D 120, and the fine-grained dataset FineGym99.

Funder

National Natural Science Foundation of China

Sichuan Science and Technology Program

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/4/2058/pdf

Reference50 articles.

1. Expansion-squeeze-excitation fusion network for elderly activity recognition;Shu;IEEE Trans. Circuits Syst. Video Technol.,2022

2. Park, S.K., Chung, J.H., Pae, D.S., and Lim, M.T. (2022). Binary Dense SIFT Flow Based Position-Information Added Two-Stream CNN for Pedestrian Action Recognition. Appl. Sci., 12.

3. Action recognition based on RGB and skeleton data sets: A survey;Yue;Neurocomputing,2022

4. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates;Liu;IEEE Trans. Pattern Anal. Mach. Intell.,2017

5. Imran, J., and Kumar, P. (2016, January 21–24). Human action recognition using RGB-D sensor and deep convolutional neural networks. Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India.

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Dense-Sparse Complementary Network for Human Action Recognition based on RGB and Skeleton Modalities;Expert Systems with Applications;2024-06

2. Temporal-Channel Attention and Convolution Fusion for Skeleton-Based Human Action Recognition;IEEE Access;2024

3. Multimodal action recognition: a comprehensive survey on temporal modeling;Multimedia Tools and Applications;2023-12-22

4. Non-Uniform Motion Aggregation with Graph Convolutional Networks for Skeleton-Based Human Action Recognition;Electronics;2023-10-30

5. Mitigating Context Bias in Action Recognition via Skeleton-Dominated Two-Stream Network;Proceedings of the 2023 Workshop on Advanced Multimedia Computing for Smart Manufacturing and Engineering;2023-10-29