A Tracking-Based Two-Stage Framework for Spatio-Temporal Action Detection-Reference-Cited by-同舟云学术

A Tracking-Based Two-Stage Framework for Spatio-Temporal Action Detection

Published:2024-01-23 Issue:3 Volume:13 Page:479
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Luo Jing¹^ORCID,Yang Yulin¹²,Liu Rongkai¹,Chen Li¹,Fei Hongxiao¹,Hu Chao³⁴,Shi Ronghua³,Zou You⁵

Affiliation:

1. School of Computer, Central South University, Changsha 410000, China

2. Hunan Hanma Technology Co., Ltd., Changsha 410083, China

3. School of Electronic Information, Central South University, Changsha 410000, China

4. Hunan “the 14th Five-Year Plan” Research Base of Education Sciences (Research on Educational Informatization), Central South University, Changsha 410083, China

5. Information and Networking Center, Central South University, Changsha 410083, China

Abstract

Spatio-temporal action detection (STAD) is a task receiving widespread attention and has numerous application scenarios, such as video surveillance and smart education. Current studies follow a localization-based two-stage detection paradigm, which exploits a person detector for action localization and a feature processing model with a classifier for action classification. However, many issues occur due to the imbalance between task settings and model complexity in STAD. Firstly, the model complexity of heavy offline person detectors adds to the inference overhead. Secondly, the frame-level actor proposals are incompatible with the video-level feature aggregation and Region-of-Interest feature pooling in action classification, which limits the detection performance under diverse action motions and results in low detection accuracy. In this paper, we propose a tracking-based two-stage spatio-temporal action detection framework called TrAD. The key idea of TrAD is to build video-level consistency and reduce model complexity in our STAD framework by generating action track proposals among multiple video frames instead of actor proposals in a single frame. In particular, we utilize tailored tracking to simulate the behavior of human cognitive actions and used the captured motion trajectories as video-level proposals. We then integrate a proposal scaling method and a feature aggregation module into action classification to enhance feature pooling for detected tracks. Evaluations in the AVA dataset demonstrate that TrAD achieves SOTA performance with 29.7 mAP, while also facilitating a 58% reduction in overall computation compared to SlowFast.

Funder

High Performance Computing Center of Central South University

National Natural Science Foundation

Hunan Educational Science

Hunan Social Science Foundation

Central South University Graduate Education Teaching Reform Project

Hunan Provincial Archives Technology Project

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/13/3/479/pdf

Reference43 articles.

1. Gkioxari, G., and Malik, J. (2015, January 7–12). Finding action tubes. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA.

2. Li, Y., Chen, L., He, R., Wang, Z., Wu, G., and Wang, L. (2021, January 10–17). MultiSports: A Multi-Person Video Dataset of Spatio-Temporally Localized Sports Actions. Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada.

3. Dave, I.R., Scheffer, Z., Kumar, A., Shiraz, S., Rawat, Y.S., and Shah, M. (2022, January 4–8). GabriellaV2: Towards better generalization in surveillance videos for Action Detection. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, WACV—Workshops, Waikoloa, HI, USA.

4. Student Class Behavior Dataset: A video dataset for recognizing, detecting, and captioning students’ behaviors in classroom scenes;Sun;Neural Comput. Appl.,2021

5. Girshick, R.B. (2015, January 7–13). Fast R-CNN. Proceedings of the 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile.