Visual Question Answering reasoning with external knowledge based on bimodal graph neural network-Reference-Cited by-同舟云学术

Visual Question Answering reasoning with external knowledge based on bimodal graph neural network

Published:2023 Issue:4 Volume:31 Page:1948-1965
ISSN:2688-1594
Container-title:Electronic Research Archive
language:
Short-container-title:era

Author:

Yang Zhenyu¹²,Wu Lei³,Wen Peian²,Chen Peng²

Affiliation:

1. Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 314099, China

2. School of Computer and Software Engineering, Xihua University, Chengdu 610039, China

3. School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China

Abstract

<abstract><p>Visual Question Answering (VQA) with external knowledge requires external knowledge and visual content to answer questions about images. The defect of existing VQA solutions is that they need to identify task-related information in the obtained pictures, questions, and knowledge graphs. It is necessary to properly fuse and embed the information between different modes identified, to reduce the noise and difficulty in cross-modality reasoning of VQA models. However, this process of rationally integrating information between different modes and joint reasoning to find relevant evidence to correctly predict the answer to the question still deserves further study. This paper proposes a bimodal Graph Neural Network model combining pre-trained Language Models and Knowledge Graphs (BIGNN-LM-KG). Researchers built the concepts graph by the images and questions concepts separately. In constructing the concept graph, we used the combined reasoning advantages of LM+KG. Specifically, use KG to jointly infer the images and question entity concepts to build a concept graph. Use LM to calculate the correlation score to screen the nodes and paths of the concept graph. Then, we form a visual graph from the visual and spatial features of the filtered image entities. We use the improved GNN to learn the representation of the two graphs and to predict the most likely answer by fusing the information of two different modality graphs using a modality fusion GNN. On the common dataset of VQA, the model we proposed obtains good experiment results. It also verifies the validity of each component in the model and the interpretability of the model.</p></abstract>

Publisher

American Institute of Mathematical Sciences (AIMS)

Subject

General Mathematics

Reference45 articles.

1. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, et al., Vqa: visual question answering, in Proceedings of the IEEE International Conference on Computer Vision (ICCV), (2015), 2425–2433. https://doi.org/10.1109/ICCV.2015.279

2. R. Cadene, H. Ben-Younes, M. Cord, N. Thome, Murel: multimodal relational reasoning for visual question answering, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), (2019), 1989–1998. https://doi.org/10.1109/CVPR.2019.00209

3. L. Li, Z. Gan, Y. Cheng, J. Liu, Relation-aware graph attention network for visual question answering, in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), (2019), 10313–10322. https://doi.org/10.1109/ICCV.2019.01041

4. H. Ben-Younes, R. Cadene, N. Thome, M. Cord, Block: bilinear superdiagonal fusion for visual question answering and visual relationship detection, in Proceedings of the AAAI Conference on Artificial Intelligence (AI), 33 (2019), 8102–8109. https://doi.org/10.1609/aaai.v33i01.33018102

5. C. Song, M. Liu, J. Cao, Y. Zheng, H. Gong, G. Chen, Maximizing network lifetime based on transmission range adjustment in wireless sensor networks, Comput. Commun., 32 (2009), 1316–1325. https://doi.org/10.1016/j.comcom.2009.02.002