Author:
Chen Jingwen,Pan Yingwei,Li Yehao,Yao Ting,Chao Hongyang,Mei Tao
Abstract
It is well believed that video captioning is a fundamental but challenging task in both computer vision and artificial intelligence fields. The prevalent approach is to map an input video to a variable-length output sentence in a sequence to sequence manner via Recurrent Neural Network (RNN). Nevertheless, the training of RNN still suffers to some degree from vanishing/exploding gradient problem, making the optimization difficult. Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations. In this paper, we present a novel design — Temporal Deformable Convolutional Encoder-Decoder Networks (dubbed as TDConvED) that fully employ convolutions in both encoder and decoder networks for video captioning. Technically, we exploit convolutional block structures that compute intermediate states of a fixed number of inputs and stack several blocks to capture long-term relationships. The structure in encoder is further equipped with temporal deformable convolution to enable free-form deformation of temporal sampling. Our model also capitalizes on temporal attention mechanism for sentence generation. Extensive experiments are conducted on both MSVD and MSR-VTT video captioning datasets, and superior results are reported when comparing to conventional RNN-based encoder-decoder techniques. More remarkably, TDConvED increases CIDEr-D performance from 58.8% to 67.2% on MSVD.
Publisher
Association for the Advancement of Artificial Intelligence (AAAI)
Cited by
47 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Boosting Semi-Supervised Video Captioning via Learning Candidates Adjusters;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-07-11
2. ATCE: Adaptive Temporal Context Exploitation for Weakly-Supervised Temporal Action Localization;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30
3. AI Enhanced Video Sequence Description Generator;2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS);2024-04-18
4. Video captioning – a survey;Multimedia Tools and Applications;2024-04-09
5. Joint multi-scale information and long-range dependence for video captioning;International Journal of Multimedia Information Retrieval;2023-11-14