Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering-Reference-Cited by-同舟云学术

Hierarchical Synergy-Enhanced Multimodal Relational Network for Video Question Answering

Published:2023-12-11 Issue:4 Volume:20 Page:1-22
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Peng Min¹^ORCID,Shao Xiaohu²^ORCID,Shi Yu³^ORCID,Zhou Xiangdong³^ORCID

Affiliation:

1. Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, China and Chongqing School, University of Chinese Academy of Sciences, China

2. Beijing IDRIVERPLUS Technology Co., Ltd, China

3. Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences, China

Abstract

Video question answering (VideoQA) is challenging as it requires reasoning about natural language and multimodal interactive relations. Most existing methods apply attention mechanisms to extract interactions between the question and the video or to extract effective spatio-temporal relational representations. However, these methods neglect the implication of relations between intra- and inter-modal interactions for multimodal learning, and they fail to fully exploit the synergistic effect of multiscale semantics in answer reasoning. In this article, we propose a novel hierarchical synergy-enhanced multimodal relational network (HMRNet) to address these issues. Specifically, we devise (i) a compact and unified relation-oriented interaction module that explores the relation between intra- and inter-modal interactions to enable effective multimodal learning; and (ii) a hierarchical synergistic memory unit that leverages a memory-based interaction scheme to complement and fuse multimodal semantics at multiple scales to achieve synergistic enhancement of answer reasoning. With careful design of each component, our HMRNet has fewer parameters and is computationally efficient. Extensive experiments and qualitative analyses demonstrate that the HMRNet is superior to previous state-of-the-art methods on eight benchmark datasets. We also demonstrate the effectiveness of the different components of our method.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3630101

Reference74 articles.

1. Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning

2. VQA: Visual Question Answering

3. Kyunghyun Cho, Bart van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, 8th Workshop on Syntax, Semantics, and Structure in Statistical Translation. ACL, Doha, Qatar, 103–111.

4. Long Hoang Dang, Thao Minh Le, Vuong Le, and Truyen Tran. 2021. Hierarchical object-oriented spatio-temporal reasoning for video question answering. In Proceedings of the 30th International Joint Conference on Artificial Intelligence, IJCAI-21. IJCAI, Montreal, Canada, 636–642.

5. Heterogeneous Memory Enhanced Multimodal Attention Model for Video Question Answering