Affiliation:
1. Department of Electrical Engineering, City University of Hong Kong, Hong Kong, China
2. School of Engineering, University of Warwick, Gibbet Hill Road, Coventry CV4 7AL, UK
Abstract
Most current deep learning models are suboptimal in terms of the flexibility of their input shape. Usually, computer vision models only work on one fixed shape used during training, otherwise their performance degrades significantly. For video-related tasks, the length of each video (i.e., number of video frames) can vary widely; therefore, sampling of video frames is employed to ensure that every video has the same temporal length. This training method brings about drawbacks in both the training and testing phases. For instance, a universal temporal length can damage the features in longer videos, preventing the model from flexibly adapting to variable lengths for the purposes of on-demand inference. To address this, we propose a simple yet effective training paradigm for 3D convolutional neural networks (3D-CNN) which enables them to process videos with inputs having variable temporal length, i.e., variable length training (VLT). Compared with the standard video training paradigm, our method introduces three extra operations during training: sampling twice, temporal packing, and subvideo-independent 3D convolution. These operations are efficient and can be integrated into any 3D-CNN. In addition, we introduce a consistency loss to regularize the representation space. After training, the model can successfully process video with varying temporal length without any modification in the inference phase. Our experiments on various popular action recognition datasets demonstrate the superior performance of the proposed method compared to conventional training paradigm and other state-of-the-art training paradigms.
Funder
Research Grants Council of the Hong Kong Special Administrative Region, China
Reference32 articles.
1. Li, Y., Li, Y., and Vasconcelos, N. (2018, January 8–14). Resound: Towards action recognition without representation bias. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany.
2. Shao, D., Zhao, Y., Dai, B., and Lin, D. (2020, January 13–19). Finegym: A hierarchical video dataset for fine-grained action understanding. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA.
3. TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Devices;Lin;IEEE Trans. Pattern Anal. Mach. Intell.,2022
4. Wang, L., Tong, Z., Ji, B., and Wu, G. (2021, January 20–25). Tdn: Temporal difference networks for efficient action recognition. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA.
5. UniFormer: Unifying Convolution and Self-Attention for Visual Recognition;Li;IEEE Trans. Pattern Anal. Mach. Intell.,2023