ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence-Reference-Cited by-同舟云学术

ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence

Published:2021-07-03 Issue:13 Volume:10 Page:1601
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Rendón-Segador Fernando J.^ORCID,Álvarez-García Juan A.^ORCID,Enríquez Fernando^ORCID,Deniz Oscar^ORCID

Abstract

Introducing efficient automatic violence detection in video surveillance or audiovisual content monitoring systems would greatly facilitate the work of closed-circuit television (CCTV) operators, rating agencies or those in charge of monitoring social network content. In this paper we present a new deep learning architecture, using an adapted version of DenseNet for three dimensions, a multi-head self-attention layer and a bidirectional convolutional long short-term memory (LSTM) module, that allows encoding relevant spatio-temporal features, to determine whether a video is violent or not. Furthermore, an ablation study of the input frames, comparing dense optical flow and adjacent frames subtraction and the influence of the attention layer is carried out, showing that the combination of optical flow and the attention mechanism improves results up to 4.4%. The conducted experiments using four of the most widely used datasets for this problem, matching or exceeding in some cases the results of the state of the art, reducing the number of network parameters needed (4.5 millions), and increasing its efficiency in test accuracy (from 95.6% on the most complex dataset to 100% on the simplest one) and inference time (less than 0.3 s for the longest clips). Finally, to check if the generated model is able to generalize violence, a cross-dataset analysis is performed, which shows the complexity of this approach: using three datasets to train and testing on the remaining one the accuracy drops in the worst case to 70.08% and in the best case to 81.51%, which points to future work oriented towards anomaly detection in new datasets.

Funder

Ministerio de Economía, Industria y Competitividad, Gobierno de España

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/10/13/1601/pdf

Reference67 articles.

1. A survey of video datasets for human action and activity recognition

2. A survey on still image based human action recognition

3. On the Performance of One-Stage and Two-Stage Object Detectors in Autonomous Vehicles Using Camera Data

4. A motion-based image processing system for detecting potentially dangerous situations in underground railway stations

5. Buyer beware;Ainsworth;Secur. Oz,2002

Cited by 36 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Multimodal fusion: A study on speech-text emotion recognition with the integration of deep learning;Intelligent Systems with Applications;2024-12

2. Elevating urban surveillance: A deep CCTV monitoring system for detection of anomalous events via human action recognition;Sustainable Cities and Society;2024-11

3. Revisiting vision-based violence detection in videos: A critical analysis;Neurocomputing;2024-09

4. Detection and Blurring Bloodstained Violence Scene by Convolutional Neural Network-Based Model for Media Platforms;SN Computer Science;2024-08-28

5. Transformer and Adaptive Threshold Sliding Window for Improving Violence Detection in Videos;Sensors;2024-08-22