Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph-Reference-Cited by-同舟云学术

Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph

Published:2023-03-14 Issue:6 Volume:12 Page:1390
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Jiang Lei¹^ORCID,Meng Zuqiang¹

Affiliation:

1. School of Computer, Electronics and Information, Guangxi University, Nanning 530000, China

Abstract

The field of visual question answering (VQA) has seen a growing trend of integrating external knowledge sources to improve performance. However, owing to the potential incompleteness of external knowledge sources and the inherent mismatch between different forms of data, current knowledge-based visual question answering (KBVQA) techniques are still confronted with the challenge of effectively integrating and utilizing multiple heterogeneous data. To address this issue, a novel approach centered on a multi-modal semantic graph (MSG) is proposed. The MSG serves as a mechanism for effectively unifying the representation of heterogeneous data and diverse types of knowledge. Additionally, a multi-modal semantic graph knowledge reasoning model (MSG-KRM) is introduced to perform reasoning and deep fusion of image–text information and external knowledge sources. The development of the semantic graph involves extracting keywords from the image object detection information, question text, and external knowledge texts, which are then represented as symbol nodes. Three types of semantic graphs are then constructed based on the knowledge graph, including vision, question, and the external knowledge text, with non-symbol nodes added to connect these three independent graphs and marked with respective node and edge types. During the inference stage, the multi-modal semantic graph and image–text information are embedded into the feature semantic graph through three embedding methods, and a type-aware graph attention module is employed for deep reasoning. The final answer prediction is a blend of the output from the pre-trained model, graph pooling results, and the characteristics of non-symbolic nodes. The experimental results on the OK-VQA dataset show that the MSG-KRM model is superior to existing methods in terms of overall accuracy score, achieving a score of 43.58, and with improved accuracy for most subclass questions, proving the effectiveness of the proposed method.

Funder

National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/12/6/1390/pdf

Reference36 articles.

1. Marino, K., Rastegari, M., Farhadi, A., and Mottaghi, R. (2019, January 15–20). Ok-Vqa: A Visual Question Answering Benchmark Requiring External Knowledge. Proceedings of the IEEE/cvf Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA.

2. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7–13). Vqa: Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.

3. Kim, J.-H., Jun, J., and Zhang, B.-T. (2018, January 3–8). Bilinear Attention Networks. Proceedings of the 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada.

4. Ben-Younes, H., Cadene, R., Cord, M., and Thome, N. (2017, January 22–29). Mutan: Multimodal Tucker Fusion for Visual Question Answering. Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy.

5. Xia, Q., Yu, C., Hou, Y., Peng, P., Zheng, Z., and Chen, W. (2022). Multi-Modal Alignment of Visual Question Answering Based on Multi-Hop Attention Mechanism. Electronics, 11.

Cited by 7 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Caption matters: a new perspective for knowledge-based visual question answering;Knowledge and Information Systems;2024-07-22

2. Explainable Knowledge reasoning via thought chains for knowledge-based visual question answering;Information Processing & Management;2024-07

3. Deep Multimodal Data Fusion;ACM Computing Surveys;2024-04-24

4. Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension;Electronics;2024-02-27

5. Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering;Journal of Imaging;2024-02-23