3D-STARNET: Spatial–Temporal Attention Residual Network for Robust Action Recognition
-
Published:2024-08-15
Issue:16
Volume:14
Page:7154
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Yang Jun12ORCID, Sun Shulong2ORCID, Chen Jiayue1, Xie Haizhen1, Wang Yan1, Yang Zenglong1
Affiliation:
1. Big Data and Internet of Things Research Center, China University of Mining and Technology, Beijing 100083, China 2. Key Laboratory of Intelligent Mining and Robotics, Ministry of Emergency Management, Beijing 100083, China
Abstract
Existing skeleton-based action recognition methods face the challenges of insufficient spatiotemporal feature mining and a low efficiency of information transmission. To solve these problems, this paper proposes a model called the Spatial–Temporal Attention Residual Network for 3D human action recognition (3D-STARNET). This model significantly improves the performance of action recognition through the following three main innovations: (1) the conversion from skeleton points to heat maps. Using Gaussian transform to convert skeleton point data into heat maps effectively reduces the model’s strong dependence on the original skeleton point data and enhances the stability and robustness of the data; (2) a spatiotemporal attention mechanism (STA). A novel spatiotemporal attention mechanism is proposed, focusing on the extraction of key frames and key areas within frames, which significantly enhances the model’s ability to identify behavioral patterns; (3) a multi-stage residual structure (MS-Residual). The introduction of a multi-stage residual structure improves the efficiency of data transmission in the network, solves the gradient vanishing problem in deep networks, and helps to improve the recognition efficiency of the model. Experimental results on the NTU-RGBD120 dataset show that 3D-STARNET has significantly improved the accuracy of action recognition, and the top1 accuracy of the overall network reached 96.74%. This method not only solves the robustness shortcomings of existing methods, but also improves the ability to capture spatiotemporal features, providing an efficient and widely applicable solution for action recognition based on skeletal data.
Funder
National Special Project of Science and Technology Basic Resources Survey National Natural Science Foundation of China Innovation Group Project
Reference32 articles.
1. Human action recognition from various data modalities: A review;Sun;IEEE Trans. Pattern Anal. Mach. Intell.,2022 2. Shi, L., Zhang, Y., Cheng, J., and Lu, H. (2019, January 15–20). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 3. Sevilla-Lara, L., Liao, Y., Güney, F., Jampani, V., Geiger, A., and Black, M.J. (2018, January 9–12). On the integration of optical flow and action recognition. Proceedings of the Pattern Recognition: 40th German Conference, GCPR 2018, Stuttgart, Germany. 4. Baek, S., Shi, Z., Kawade, M., and Kim, T.-K. (2016). Kinematic-layout-aware random forests for depth-based action recognition. arXiv. 5. Human activity classification based on micro-Doppler signatures using a support vector machine;Kim;IEEE Trans. Geosci. Remote Sens.,2009
|
|