Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering-Reference-Cited by-同舟云学术

Divide and Conquer: Question-Guided Spatio-Temporal Contextual Attention for Video Question Answering

Published:2020-04-03 Issue:07 Volume:34 Page:11101-11108
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Jiang Jianwen,Chen Ziqiang,Lin Haojie,Zhao Xibin,Gao Yue

Abstract

Understanding questions and finding clues for answers are the key for video question answering. Compared with image question answering, video question answering (Video QA) requires to find the clues accurately on both spatial and temporal dimension simultaneously, and thus is more challenging. However, the relationship between spatio-temporal information and question still has not been well utilized in most existing methods for Video QA. To tackle this problem, we propose a Question-Guided Spatio-Temporal Contextual Attention Network (QueST) method. In QueST, we divide the semantic features generated from question into two separate parts: the spatial part and the temporal part, respectively guiding the process of constructing the contextual attention on spatial and temporal dimension. Under the guidance of the corresponding contextual attention, visual features can be better exploited on both spatial and temporal dimensions. To evaluate the effectiveness of the proposed method, experiments are conducted on TGIF-QA dataset, MSRVTT-QA dataset and MSVD-QA dataset. Experimental results and comparisons with the state-of-the-art methods have shown that our method can achieve superior performance.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 47 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Bottom-Up Hierarchical Propagation Networks with Heterogeneous Graph Modeling for Video Question Answering;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

2. Hierarchical synchronization with structured multi-granularity interaction for video question answering;Neurocomputing;2024-05

3. Video Q &A based on two-stage deep exploration of temporally-evolving features with enhanced cross-modal attention mechanism;Neural Computing and Applications;2024-02-27

4. A multimodal fusion-based deep learning framework combined with local-global contextual TCNs for continuous emotion recognition from videos;Applied Intelligence;2024-02

5. CAD - Contextual Multi-modal Alignment for Dynamic AVQA;2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV);2024-01-03