3D-ShuffleViT: An Efficient Video Action Recognition Network with Deep Integration of Self-Attention and Convolution-Reference-Cited by-同舟云学术

3D-ShuffleViT: An Efficient Video Action Recognition Network with Deep Integration of Self-Attention and Convolution

Published:2023-09-08 Issue:18 Volume:11 Page:3848
ISSN:2227-7390
Container-title:Mathematics
language:en
Short-container-title:Mathematics

Author:

Wang Yinghui¹,Zhu Anlei¹,Ma Haomiao²,Ai Lingyu³,Song Wei¹,Zhang Shaojie¹

Affiliation:

1. School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi 214122, China

2. School of Computer Science, Shaanxi Normal University, Xi’an 710119, China

3. School of Internet of Things Engineering, Jiangnan University, Wuxi 214122, China

Abstract

Compared with traditional methods, the action recognition model based on 3D convolutional deep neural network captures spatio-temporal features more accurately, resulting in higher accuracy. However, the large number of parameters and computational requirements of 3D models make it difficult to deploy on mobile devices with limited computing power. In order to achieve an efficient video action recognition model, we have analyzed and compared classic lightweight network principles and proposed the 3D-ShuffleViT network. By deeply integrating the self-attention mechanism with convolution, we have introduced an efficient ACISA module that further enhances the performance of our proposed model. This has resulted in exceptional performance in both context-sensitive and context-independent action recognition, while reducing deployment costs. It is worth noting that our 3D-ShuffleViT network, with a computational cost of only 6% of that of SlowFast-ResNet101, achieved 98% of the latter’s Top1 accuracy on the EgoGesture dataset. Furthermore, on the same CPU (Intel i5-8300H), its speed was 2.5 times that of the latter. In addition, when we deployed our model on edge devices, our proposed network achieved the best balance between accuracy and speed among lightweight networks of the same order.

Funder

National Natural Science Foundation of China

“Double Creation” Plan of Jiangsu Province

“Taihu Talent-Innovative Leading Talent” Plan of Wuxi City

Publisher

MDPI AG

Subject

General Mathematics,Engineering (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2227-7390/11/18/3848/pdf

Reference38 articles.

1. Feichtenhofer, C. (2020, January 14–19). X3D: Expanding architectures for efficient video recognition. Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA.

2. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7–13). Learning spatiotemporal features with 3D convolutional networks. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.

3. Carreira, J., and Zisserman, A. (2017, January 21–26). Quo vadis, action recognition? A new model and the kinetics dataset. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.

4. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18–22). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA.

5. Feichtenhofer, C., Fan, H., Malik, J., and He, K. (November, January 27). Slowfast networks for video recognition. Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Spectral intelligent detection for aflatoxin B1 via contrastive learning based on Siamese network;Food Chemistry;2024-08

2. mXception and dynamic image for hand gesture recognition;Neural Computing and Applications;2024-02-17