A coarse-to-fine temporal action detection method combining light and heavy networks-Reference-Cited by-同舟云学术

A coarse-to-fine temporal action detection method combining light and heavy networks

Published:2022-06-10 Issue:1 Volume:82 Page:879-898
ISSN:1380-7501
Container-title:Multimedia Tools and Applications
language:en
Short-container-title:Multimed Tools Appl

Author:

Zhao Fan^ORCID,Wang Wen,Wu Yu,Wang Kaixuan,Kang Xiaobing

Abstract

AbstractTemporal action detection aims to judge whether there existing a certain number of action instances in a long untrimmed videos and to locate the start and end time of each action. Even though the existing action detection methods have shown promising results in recent years with the widespread application of Convolutional Neural Network (CNN), it is still a challenging problem to accurately locate each action segment while ensuring real-time performance. In order to achieve a good tradeoff between detection efficiency and accuracy, we present a coarse-to-fine hierarchical temporal action detection method by using multi-scale sliding window mechanism. Since the complexity of the convolution operator is proportional to the number and the size of the input video clips, the idea of our proposed method is to first determine candidate action proposals and then perform the detection task on these candidate action proposals only with a view to reducing the overall complexity of the detection method. By making full use of the spatio-temporal information of video clips, a lightweight 3D-CNN classifier is first used to quickly determine whether the video clip is a candidate action proposal, avoiding the re-detection of a large number of non-action video clips by the heavyweight deep network. A heavyweight detector is designed to further improve the accuracy of action positioning by considering both boundary regression loss and category loss in the target loss function. In addition, the Non-Maximum Suppression (NMS) is performed to eliminate redundant detection results among the overlapping proposals. The mean Average Precision (mAP) is 40.6%, 51.7% and 20.4% on THUMOS14, ActivityNet and MPII Cooking dataset when the Intersection-over-Union (tIoU) threshold is set to 0.5, respectively. Experimental results show the superior performance of the proposed method on three challenging temporal activity detection datasets while achieving real-time speed. At the same time, our method can generate proposals for unseen action classes with high recalls.

Funder

Natural Science Foundation of Shaanxi Province

National Natural Science Foundation of China

Publisher

Springer Science and Business Media LLC

Subject

Computer Networks and Communications,Hardware and Architecture,Media Technology,Software

Link

https://link.springer.com/content/pdf/10.1007/s11042-022-12720-7.pdf

Reference58 articles.

1. Buch S, Escorcia V, Ghanem B, Li F, Niebles J (2017) End-to-end, single-stream temporal action detection in untrimmed videos. In Proceedings of the British Machine Vision Conference

2. Caba F, Carlos J, Ghanem B (2016) fast temporal activity proposals for efficient detection of human actions in untrimmed videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1914-1923

3. Caba Heilbron F, Escorcia V, Ghanem B, Carlos J (2015) Activitynet: a large-scale video benchmark for human activity understanding. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 961–970

4. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: proceedings of the IEEE conference on computer vision and pattern recognition, pp 6299-6308

5. Chen G, Zhang C, Zou Y (2020) AFNet: temporal locality-aware network with dual structure for accurate and fast action detection. IEEE Trans Multimedia 23:2672–2682