Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering-Reference-Cited by-同舟云学术

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Published:2022-07 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Peng Min¹²,Wang Chongyang³,Gao Yuan⁴,Shi Yu²,Zhou Xiang-Dong²

Affiliation:

1. University of Chinese Academy of Sciences

2. Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences

3. University College London

4. Shenzhen Institute of Artificial Intelligence and Robotics for Society

Abstract

Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language processing. While most existing approaches ignore the visual appearance-motion information at different temporal scales, it is unknown how to incorporate the multilevel processing capacity of a deep learning model with such multiscale information. Targeting these issues, this paper proposes a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA. MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR). With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations. Thereon, with a shared transformer encoder, PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels. Through extensive experiments on three VideoQA datasets, we demonstrate improved performances than previous state-of-the-arts and justify the effectiveness of each part of our method.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Bottom-Up Hierarchical Propagation Networks with Heterogeneous Graph Modeling for Video Question Answering;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

2. Multi-Granularity Contrastive Cross-Modal Collaborative Generation for End-to-End Long-Term Video Question Answering;IEEE Transactions on Image Processing;2024

3. Contrastive Video Question Answering via Video Graph Transformer;IEEE Transactions on Pattern Analysis and Machine Intelligence;2023-11-01

4. A multi-scale self-supervised hypergraph contrastive learning framework for video question answering;Neural Networks;2023-11

5. Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26