Dilated Multi-Temporal Modeling for Action Recognition-Reference-Cited by-同舟云学术

Dilated Multi-Temporal Modeling for Action Recognition

Published:2023-06-08 Issue:12 Volume:13 Page:6934
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Zhang Tao¹^ORCID,Wu Yifan¹,Li Xiaoqiang¹^ORCID

Affiliation:

1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China

Abstract

Action recognition involves capturing temporal information from video clips where the duration varies with videos for the same action. Due to the diverse scale of temporal context, uniform size kernels utilized in convolutional neural networks (CNNs) limit the capability of multiple-scale temporal modeling. In this paper, we propose a novel dilated multi-temporal (DMT) module that provides a solution for modeling multi-temporal information in action recognition. By using dilated convolutions with different dilation rates in different feature map channels, the DMT module captures information at multiple scales without the need for costly multi-branch networks, input-level frame pyramids, or feature map stacking that previous works have usually incurred. Therefore, this approach enables the integration of temporal information from multiple scales. In addition, the DMT module can be integrated into existing 2D CNNs, making it a straightforward and intuitive solution for addressing the challenge of multi-temporal modeling. Our proposed method has demonstrated promising results in performance and has achieved about 2% and 1% accuracy improvement on FineGym99 and SthV1. We conducted an empirical analysis that demonstrates how DMT improves the classification accuracy for action classes with varying durations.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/12/6934/pdf

Reference41 articles.

1. Two-stream convolutional networks for action recognition in videos;Simonyan;Adv. Neural Inf. Process. Syst.,2014

2. 3D convolutional neural networks for human action recognition;Ji;IEEE Trans. Pattern Anal. Mach. Intell.,2012

3. Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7–13). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.

4. Qiu, Z., Yao, T., and Mei, T. (2017, January 22–29). Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.

5. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18–22). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.