Structured Two-Stream Attention Network for Video Question Answering-Reference-Cited by-同舟云学术

Structured Two-Stream Attention Network for Video Question Answering

Published:2019-07-17 Issue: Volume:33 Page:6391-6398
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Gao Lianli,Zeng Pengpeng,Song Jingkuan,Li Yuan-Fang,Liu Wu,Mei Tao,Shen Heng Tao

Abstract

To date, visual question answering (VQA) (i.e., image QA and video QA) is still a holy grail in vision and language understanding, especially for video QA. Compared with image QA that focuses primarily on understanding the associations between image region-level details and corresponding questions, video QA requires a model to jointly reason across both spatial and long-range temporal structures of a video as well as text to provide an accurate answer. In this paper, we specifically tackle the problem of video QA by proposing a Structured Two-stream Attention network, namely STA, to answer a free-form or open-ended natural language question about the content of a given video. First, we infer rich longrange temporal structures in videos using our structured segment component and encode text features. Then, our structured two-stream attention component simultaneously localizes important visual instance, reduces the influence of background video and focuses on the relevant text. Finally, the structured two-stream fusion component incorporates different segments of query and video aware context representation and infers the answers. Experiments on the large-scale video QA dataset TGIF-QA show that our proposed method significantly surpasses the best counterpart (i.e., with one representation for the video input) by 13.0%, 13.5%, 11.0% and 0.3 for Action, Trans., TrameQA and Count tasks. It also outperforms the best competitor (i.e., with two representations) on the Action, Trans., TrameQA tasks by 4.1%, 4.7%, and 5.1%.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 25 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Harnessing Representative Spatial-Temporal Information for Video Question Answering;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-07-05

2. Hierarchical synchronization with structured multi-granularity interaction for video question answering;Neurocomputing;2024-05

3. Transform-Equivariant Consistency Learning for Temporal Sentence Grounding;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11

4. Object-based Appearance-Motion Heterogeneous Network for Video Question Answering;2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS);2023-12-17

5. Multi-Granularity Interaction and Integration Network for Video Question Answering;IEEE Transactions on Circuits and Systems for Video Technology;2023-12