Affiliation:
1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
Abstract
Action recognition involves capturing temporal information from video clips where the duration varies with videos for the same action. Due to the diverse scale of temporal context, uniform size kernels utilized in convolutional neural networks (CNNs) limit the capability of multiple-scale temporal modeling. In this paper, we propose a novel dilated multi-temporal (DMT) module that provides a solution for modeling multi-temporal information in action recognition. By using dilated convolutions with different dilation rates in different feature map channels, the DMT module captures information at multiple scales without the need for costly multi-branch networks, input-level frame pyramids, or feature map stacking that previous works have usually incurred. Therefore, this approach enables the integration of temporal information from multiple scales. In addition, the DMT module can be integrated into existing 2D CNNs, making it a straightforward and intuitive solution for addressing the challenge of multi-temporal modeling. Our proposed method has demonstrated promising results in performance and has achieved about 2% and 1% accuracy improvement on FineGym99 and SthV1. We conducted an empirical analysis that demonstrates how DMT improves the classification accuracy for action classes with varying durations.
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference41 articles.
1. Two-stream convolutional networks for action recognition in videos;Simonyan;Adv. Neural Inf. Process. Syst.,2014
2. 3D convolutional neural networks for human action recognition;Ji;IEEE Trans. Pattern Anal. Mach. Intell.,2012
3. Tran, D., Bourdev, L., Fergus, R., Torresani, L., and Paluri, M. (2015, January 7–13). Learning spatiotemporal features with 3d convolutional networks. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.
4. Qiu, Z., Yao, T., and Mei, T. (2017, January 22–29). Learning spatio-temporal representation with pseudo-3d residual networks. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.
5. Tran, D., Wang, H., Torresani, L., Ray, J., LeCun, Y., and Paluri, M. (2018, January 18–22). A closer look at spatiotemporal convolutions for action recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.