A Multi-Scale Video Longformer Network for Action Recognition-Reference-Cited by-同舟云学术

A Multi-Scale Video Longformer Network for Action Recognition

Published:2024-01-26 Issue:3 Volume:14 Page:1061
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Chen Congping¹,Zhang Chunsheng¹^ORCID,Dong Xin²

Affiliation:

1. School of Mechanical Engineering and Rail Transit, Changzhou University, Changzhou 213164, China

2. School of Innovation and Entrepreneurship, Changzhou University, Changzhou 213164, China

Abstract

Action recognition has found extensive applications in fields such as video classification and security monitoring. However, existing action recognition methods, such as those based on 3D convolutional neural networks, often struggle to capture comprehensive global information. Meanwhile, transformer-based approaches face challenges associated with excessively high computational complexity. We introduce a Multi-Scale Video Longformer network (MSVL), built upon the 3D Longformer architecture featuring a “local attention + global features” attention mechanism, enabling us to reduce computational complexity while preserving global modeling capabilities. Specifically, MSVL gradually reduces the video feature resolution and increases the feature dimensions across four stages. In the lower layers of the network (stage 1, stage 2), we leverage local window attention to alleviate local redundancy and computational demands. Concurrently, global tokens are employed to retain global features. In the higher layers of the network (stage 3, stage 4), this local window attention evolves into a dense computation mechanism, enhancing overall performance. Finally, extensive experiments are conducted on UCF101 (97.6%), HMDB51 (72.9%), and the assembly action dataset (100.0%), demonstrating the effectiveness and efficiency of the MSVL.

Funder

Jiangsu Carbon Peak Carbon Neutrality Science and Technology Innovation Project of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/3/1061/pdf

Reference44 articles.

1. Yang, S., Zhao, Y., and Ma, Y. (2019, January 12–14). Analysis of the reasons and development of short video application-Taking Tik Tok as an example. Proceedings of the 2019 9th International Conference on Information and Social Science (ICISS 2019), Manila, Philippines.

2. Xiao, X., Xu, D., and Wan, W. (2016, January 11–12). Overview: Video recognition from handcrafted method to deep learning method. Proceedings of the 2016 International Conference on Audio, Language and Image Processing (ICALIP), Shanghai, China.

3. Understanding video events: A survey of methods for automatic interpretation of semantic occurrences in video;Lavee;IEEE Trans. Syst. Man Cybern. Part (Appl. Rev.),2009

4. Dalal, N., Triggs, B., and Schmid, C. (2006, January 7–13). Human detection using oriented histograms of flow and appearance. Proceedings of the Computer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria.

5. Wang, H., and Schmid, C. (2013, January 1–8). Action recognition with improved trajectories. Proceedings of the IEEE International Conference on Computer Vision, Sydney, Australia.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Recognizing human activities with the use of Convolutional Block Attention Module;Egyptian Informatics Journal;2024-09