Weakly-Supervised Video Anomaly Detection with MTDA-Net
-
Published:2023-11-12
Issue:22
Volume:12
Page:4623
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Wu Huixin1, Yang Mengfan1, Wei Fupeng1ORCID, Shi Ge1ORCID, Jiang Wei1ORCID, Qiao Yaqiong1, Dong Hangcheng2
Affiliation:
1. School of Information Engineering, North China University of Water Resources and Electric Power, Zhengzhou 450046, China 2. School of Instrumentation Science and Engineering, Harbin Institute of Technology, Harbin 150001, China
Abstract
Weakly supervised anomalous behavior detection is a popular area at present. Compared to semi-supervised anomalous behavior detection, weakly-supervised learning both eliminates the need to crop videos and solves the problem of semi-supervised learning’s difficulty in handling long videos. Previous work has used graph convolution or self-attention mechanisms to model temporal relationships. However, these methods tend to model temporal relationships at a single scale and lack consideration of the aggregation problem for different temporal relationships. In this paper, we propose a weakly supervised anomaly detection framework, MTDA-Net, with emphasis on modeling different temporal relationships and enhanced semantic discrimination. To this end, we construct a new plug-and-play module, MTDA, which uses three branches, Multi-headed Attention (MHA), Temporal Shift (TS), and Dilated Aggregation (DA), to extract different temporal sequences. Specifically, the MHA branch can globally model the video information and project the features into different semantic spaces to enhance the expressiveness and discrimination of the features. The DA branch extracts temporal information of different scales via dilated convolution and captures the temporal features of local regions in the video. The TS branch can fuse the features of adjacent frames on a local scale and enhance the information flow. MTDA-Net can learn the temporal relationships between video segments on different branches and learn powerful video representations based on these relationships. The experimental results on the XD-Violence dataset show that MTDA-Net can significantly improve the detection accuracy of abnormal behaviors.
Funder
National Natural Science Foundation of China Key Research Projects of Henan Higher Education Institutions Open Foundation of Henan Key Laboratory of Cyberspace Situation Awareness
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference45 articles.
1. Lin, J., Gan, C., and Han, S. (November, January 27). Tsm: Temporal shift module for efficient video understanding. Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republich of Korea. 2. Feichtenhofer, C. (2020, January 13–19). X3d: Expanding architectures for efficient video recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 3. Fu, W., An, Z., Huang, W., Sun, H., Gong, W., and Gonzàlez, J. (2023). A Spatio-Temporal Spotting Network with Sliding Windows for Micro-Expression Detection. Electronics, 12. 4. Online video-based abnormal detection using highly motion techniques and statistical measures;Sudirman;TELKOMNIKA (Telecommun. Comput. Electron. Control),2019 5. Antić, B., and Ommer, B. (2011, January 6–13). Video parsing for abnormality detection. Proceedings of the 2011 International Conference on Computer Vision, Barcelona, Spain.
|
|