Affiliation:
1. Department of Computer Science and Technology, Tsinghua University
2. Tsinghua Shenzhen International Graduate School
3. ARC, Tencent PCG
Abstract
Recently, Recurrent Neural Network (RNN) based methods and Self-Attention (SA) based methods have achieved promising performance in Video Question Answering (VideoQA). Despite the success of these works, RNN-based methods tend to forget the global semantic contents due to the inherent drawbacks of the recurrent units themselves, while SA-based methods cannot precisely capture the dependencies of the local neighborhood, leading to insufficient modeling for temporal order. To tackle these problems, we propose a novel VideoQA framework which progressively refines the representations of videos and questions from fine to coarse grain in a sequence-sensitive manner. Specifically, our model improves the feature representations via the following two steps: (1) introducing two fine-grained feature-augmented memories to strengthen the information augmentation of video and text which can improve memory capacity by memorizing more relevant and targeted information. (2) appending the self-attention and co-attention module to the memory output thus the module is able to capture global interaction between high-level semantic informations. Experimental results show that our approach achieves state-of-the-art performance on VideoQA benchmark datasets.
Publisher
International Joint Conferences on Artificial Intelligence Organization
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献