Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering-Reference-Cited by-同舟云学术

Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering

Published:2019-07-17 Issue: Volume:33 Page:8658-8665
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Li Xiangpeng,Song Jingkuan,Gao Lianli,Liu Xianglong,Huang Wenbing,He Xiangnan,Gan Chuang

Abstract

Most of the recent progresses on visual question answering are based on recurrent neural networks (RNNs) with attention. Despite the success, these models are often timeconsuming and having difficulties in modeling long range dependencies due to the sequential nature of RNNs. We propose a new architecture, Positional Self-Attention with Coattention (PSAC), which does not require RNNs for video question answering. Specifically, inspired by the success of self-attention in machine translation task, we propose a Positional Self-Attention to calculate the response at each position by attending to all positions within the same sequence, and then add representations of absolute positions. Therefore, PSAC can exploit the global dependencies of question and temporal information in the video, and make the process of question and video encoding executed in parallel. Furthermore, in addition to attending to the video features relevant to the given questions (i.e., video attention), we utilize the co-attention mechanism by simultaneously modeling “what words to listen to” (question attention). To the best of our knowledge, this is the first work of replacing RNNs with selfattention for the task of visual question answering. Experimental results of four tasks on the benchmark dataset show that our model significantly outperforms the state-of-the-art on three tasks and attains comparable result on the Count task. Our model requires less computation time and achieves better performance compared with the RNNs-based methods. Additional ablation study demonstrates the effect of each component of our proposed model.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 105 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Harnessing Representative Spatial-Temporal Information for Video Question Answering;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-07-05

2. Hierarchical synchronization with structured multi-granularity interaction for video question answering;Neurocomputing;2024-05

3. Semantic Enrichment for Video Question Answering with Gated Graph Neural Networks;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14

4. CAD - Contextual Multi-modal Alignment for Dynamic AVQA;2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV);2024-01-03

5. Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering;IEEE Transactions on Image Processing;2024