Affiliation:
1. School of Mechanical Engineering and Rail Transit, Changzhou University, Changzhou 213164, China
2. School of Innovation and Entrepreneurship, Changzhou University, Changzhou 213164, China
Abstract
Action recognition has found extensive applications in fields such as video classification and security monitoring. However, existing action recognition methods, such as those based on 3D convolutional neural networks, often struggle to capture comprehensive global information. Meanwhile, transformer-based approaches face challenges associated with excessively high computational complexity. We introduce a Multi-Scale Video Longformer network (MSVL), built upon the 3D Longformer architecture featuring a “local attention + global features” attention mechanism, enabling us to reduce computational complexity while preserving global modeling capabilities. Specifically, MSVL gradually reduces the video feature resolution and increases the feature dimensions across four stages. In the lower layers of the network (stage 1, stage 2), we leverage local window attention to alleviate local redundancy and computational demands. Concurrently, global tokens are employed to retain global features. In the higher layers of the network (stage 3, stage 4), this local window attention evolves into a dense computation mechanism, enhancing overall performance. Finally, extensive experiments are conducted on UCF101 (97.6%), HMDB51 (72.9%), and the assembly action dataset (100.0%), demonstrating the effectiveness and efficiency of the MSVL.
Funder
Jiangsu Carbon Peak Carbon Neutrality Science and Technology Innovation Project of China
Reference44 articles.
1. Yang, S., Zhao, Y., and Ma, Y. (2019, January 12–14). Analysis of the reasons and development of short video application-Taking Tik Tok as an example. Proceedings of the 2019 9th International Conference on Information and Social Science (ICISS 2019), Manila, Philippines.
2. Xiao, X., Xu, D., and Wan, W. (2016, January 11–12). Overview: Video recognition from handcrafted method to deep learning method. Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China.
3. Understanding video events: A survey of methods for automatic interpretation of semantic occurrences in video;Lavee;IEEE Trans. Syst. Man Cybern. Part (Appl. Rev.),2009
4. Dalal, N., Triggs, B., and Schmid, C. (2006, January 7–13). Human detection using oriented histograms of flow and appearance. Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria.
5. Wang, H., and Schmid, C. (2013, January 1–8). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献