MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection-Reference-Cited by-同舟云学术

MCMNET: Multi-Scale Context Modeling Network for Temporal Action Detection

Published:2023-08-31 Issue:17 Volume:23 Page:7563
ISSN:1424-8220
Container-title:Sensors
language:en
Short-container-title:Sensors

Author:

Zhang Haiping¹²,Zhou Fuxing³^ORCID,Ma Conghao³,Wang Dongjing¹^ORCID,Zhang Wanjun²^ORCID

Affiliation:

1. School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China

2. School of Information Engineering, Hangzhou Dianzi University, Hangzhou 310005, China

3. School of Electronics and Information, Hangzhou Dianzi University, Hangzhou 310018, China

Abstract

Temporal action detection is a very important and challenging task in the field of video understanding, especially for datasets with significant differences in action duration. The temporal relationships between the action instances contained in these datasets are very complex. For such videos, it is necessary to capture information with a richer temporal distribution as much as possible. In this paper, we propose a dual-stream model that can model contextual information at multiple temporal scales. First, the input video is divided into two resolution streams, followed by a Multi-Resolution Context Aggregation module to capture multi-scale temporal information. Additionally, an Information Enhancement module is added after the high-resolution input stream to model both long-range and short-range contexts. Finally, the outputs of the two modules are merged to obtain features with rich temporal information for action localization and classification. We conducted experiments on three datasets to evaluate the proposed approach. On ActivityNet-v1.3, an average mAP (mean Average Precision) of 32.83% was obtained. On Charades, the best performance was obtained, with an average mAP of 27.3%. On TSU (Toyota Smarthome Untrimmed), an average mAP of 33.1% was achieved.

Funder

University Research Initiation Fund

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Biochemistry,Instrumentation,Atomic and Molecular Physics, and Optics,Analytical Chemistry

Link

https://www.mdpi.com/1424-8220/23/17/7563/pdf

Reference73 articles.

1. Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., and Lin, D. (2017, January 22–29). Temporal action detection with structured segment networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.

2. Shou, Z., Wang, D., and Chang, S.F. (2016, January 27–30). Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA.

3. Shou, Z., Chan, J., Zareian, A., Miyazawa, K., and Chang, S.F. (2017, January 21–26). CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos. Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA.

4. Xu, H., Das, A., and Saenko, K. (2017, January 22–29). R-c3d: Region convolutional 3d network for temporal activity detection. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.

5. Dai, X., Singh, B., Zhang, G., Davis, L.S., and Chen, Y.Q. (2017, January 22–29). Temporal Context Network for Activity Localization in Videos. Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy.