Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering-Reference-Cited by-同舟云学术

Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering

Published:2020-07 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Zhu Zihao¹²,Yu Jing¹²,Wang Yujing³,Sun Yajing¹²,Hu Yue¹²,Wu Qi⁴

Affiliation:

1. Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China

2. School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China

3. Microsoft Research Asia, Beijing, China

4. University of Adelaide, Australia

Abstract

Fact-based Visual Question Answering (FVQA) requires external knowledge beyond the visible content to answer questions about an image. This ability is challenging but indispensable to achieve general VQA. One limitation of existing FVQA solutions is that they jointly embed all kinds of information without fine-grained selection, which introduces unexpected noises for reasoning the final answer. How to capture the question-oriented and information-complementary evidence remains a key challenge to solve the problem. In this paper, we depict an image by a multi-modal heterogeneous graph, which contains multiple layers of information corresponding to the visual, semantic and factual features. On top of the multi-layer graph representations, we propose a modality-aware heterogeneous graph convolutional network to capture evidence from different layers that is most relevant to the given question. Specifically, the intra-modal graph convolution selects evidence from each modality and cross-modal graph convolution aggregates relevant information across different graph layers. By stacking this process multiple times, our model performs iterative reasoning across three modalities and predicts the optimal answer by analyzing all question-oriented evidence. We achieve a new state-of-the-art performance on the FVQA task and demonstrate the effectiveness and interpretability of our model with extensive experiments.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 61 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring coherence from heterogeneous representations for OCR image captioning;Multimedia Systems;2024-09-06

2. Prompting Large Language Models with Knowledge-Injection for Knowledge-Based Visual Question Answering;Big Data Mining and Analytics;2024-09

3. Caption matters: a new perspective for knowledge-based visual question answering;Knowledge and Information Systems;2024-07-22

4. Prompting large language model with context and pre-answer for knowledge-based VQA;Pattern Recognition;2024-07

5. DSAMR: Dual-Stream Attention Multi-hop Reasoning for knowledge-based visual question answering;Expert Systems with Applications;2024-07