Motion Vector-Based Self-Attention for Real-Time Human Activity Recognition in Compressed Videos: The MVViT Approach-Reference-Cited by-同舟云学术

Motion Vector-Based Self-Attention for Real-Time Human Activity Recognition in Compressed Videos: The MVViT Approach

Published:2024-03-30 Issue:04 Volume:38 Page:
ISSN:0218-0014
Container-title:International Journal of Pattern Recognition and Artificial Intelligence
language:en
Short-container-title:Int. J. Patt. Recogn. Artif. Intell.

Author:

Praveenkumar S. M.¹^ORCID,Patil Prakashgoud¹^ORCID,Hiremath P. S.¹^ORCID

Affiliation:

1. Department of Computer Applications, KLE Technological University, Vidayanagr, Hubballi, Karnataka 580031, India

Abstract

Herein, a novel methodology is proposed for real-time recognition of human activity in a compressed domain of videos based on motion vectors and self-attention mechanism using vision transformers, and it is termed as motion vectors and vision transformers (MVViT). The videos in MPEG-4 and H.264 compression formats are considered for this study. Any video source without any prior setup could be considered by adopting the proposed method to the corresponding video codecs and camera settings. Existing algorithms for recognition of human action in a compressed video have some limitations in this regard, such as (i) requirement of keyframes at a fixed interval, (ii) usage of P frames only, and (iii) normally support single codec only. These limitations are overcome in the proposed method by using arbitrary keyframe intervals, using both P and B frames, and supporting MPEG-4 as well as H.264 codecs. The experimentation is carried out using the benchmark datasets, namely, UCF101, HMDB51, and THUMOS14, and the recognition accuracy in a compressed domain is found to be comparable to that observed in raw video data but at reduced cost of computation. The proposed MVViT method has outperformed other recent methods in terms of a lesser (61.0%) number of parameters and (63.7%) Giga Floating Point Operations Per Second (GFLOPS), while significantly improving accuracy by 0.8%, 5.9% and 16.6% for UCF101, HMDB51 and THUMOS14, respectively. Also, it is observed that the speed is increased by 8% in case of UCF101 when compared to the highest speed reported in the literature on the same dataset. The ablation study of the proposed method has been done using MVViT variants for different codecs and the performance analysis is done in comparison with the state-of-the-art network models.

Publisher

World Scientific Pub Co Pte Ltd

Link

https://www.worldscientific.com/doi/pdf/10.1142/S0218001424500058

Reference42 articles.

1. ViViT: A Video Vision Transformer

2. A survey on compressed domain video analysis techniques

3. Mimic The Raw Domain: Accelerating Action Recognition in the Compressed Domain

4. Proceedings of Machine Learning Research;Bertasius G.

5. MVmed: Fast Multi-Object Tracking in the Compressed Domain