Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition-Reference-Cited by-同舟云学术

Local-aware spatio-temporal attention network with multi-stage feature fusion for human action recognition

Published:2021-07-11 Issue:23 Volume:33 Page:16439-16450
ISSN:0941-0643
Container-title:Neural Computing and Applications
language:en
Short-container-title:Neural Comput & Applic

Author:

Hou Yaqing^ORCID,Yu Hua,Zhou Dongsheng,Wang Pengfei,Ge Hongwei,Zhang Jianxin,Zhang Qiang

Abstract

AbstractIn the study of human action recognition, two-stream networks have made excellent progress recently. However, there remain challenges in distinguishing similar human actions in videos. This paper proposes a novel local-aware spatio-temporal attention network with multi-stage feature fusion based on compact bilinear pooling for human action recognition. To elaborate, taking two-stream networks as our essential backbones, the spatial network first employs multiple spatial transformer networks in a parallel manner to locate the discriminative regions related to human actions. Then, we perform feature fusion between the local and global features to enhance the human action representation. Furthermore, the output of the spatial network and the temporal information are fused at a particular layer to learn the pixel-wise correspondences. After that, we bring together three outputs to generate the global descriptors of human actions. To verify the efficacy of the proposed approach, comparison experiments are conducted with the traditional hand-engineered IDT algorithms, the classical machine learning methods (i.e., SVM) and the state-of-the-art deep learning methods (i.e., spatio-temporal multiplier networks). According to the results, our approach is reported to obtain the best performance among existing works, with the accuracy of 95.3% and 72.9% on UCF101 and HMDB51, respectively. The experimental results thus demonstrate the superiority and significance of the proposed architecture in solving the task of human action recognition.

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Software

Link

https://link.springer.com/content/pdf/10.1007/s00521-021-06239-5.pdf

Reference42 articles.

1. Chéron G, Laptev I, Schmid C (2015) P-cnn: Pose-based cnn features for action recognition. In: Proceedings of the IEEE international conference on computer vision, pp. 3218–3226

2. Dai H, Shahzad M, Liu AX, Zhong Y (2016) Finding persistent items in data streams. Proceedings of the VLDB Endowment 10(4):289–300

3. Dalal N, Triggs B, Schmid C (2006) Human detection using oriented histograms of flow and appearance. Springer, Berlin

4. Deng J, Dong W, Socher R, Li L, Li K, Feifei L (2009) Imagenet: a large-scale hierarchical image database pp. 248–255

5. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Darrell T, Saenko K (2015) Long-term recurrent convolutional networks for visual recognition and description pp. 2625–2634

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unsupervised video-based action recognition using two-stream generative adversarial network;Neural Computing and Applications;2023-12-26

2. Spatio-Temporal Information Fusion and Filtration for Human Action Recognition;Symmetry;2023-12-08

3. Dual-Stream Spatiotemporal Networks with Feature Sharing for Monitoring Animals in the Home Cage;Sensors;2023-11-30

4. Toward Realistic 3D Human Motion Prediction With a Spatio-Temporal Cross- Transformer Approach;IEEE Transactions on Circuits and Systems for Video Technology;2023-10

5. A novel two-level interactive action recognition model based on inertial data fusion;Information Sciences;2023-07