Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data-Reference-Cited by-同舟云学术

Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data

Published:2023-05-10 Issue:10 Volume:12 Page:2183
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Zhu He¹^ORCID,Togo Ren²^ORCID,Ogawa Takahiro²^ORCID,Haseyama Miki²^ORCID

Affiliation:

1. Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan

2. Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan

Abstract

As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations for VQA that humans can better understand has gradually supplanted heatmap representations as the mainstream focus in the field. Humans typically answer questions by first identifying the primary objects in an image and then referring to various information sources, both within and beyond the image, including prior knowledge. However, previous studies have only considered input images, resulting in insufficient information that can lead to incorrect answers and implausible explanations. To address this issue, we introduce multiple references in addition to the input image. Specifically, we propose a multimodal model that generates natural language explanations for VQA. We introduce outside knowledge using the input image and question and incorporate object information into the model through an object detection module. By increasing the information available during the model generation process, we significantly improve VQA accuracy and the reliability of the generated explanations. Moreover, we employ a simple and effective feature fusion joint vector to combine information from multiple modalities while maximizing information preservation. Qualitative and quantitative evaluation experiments demonstrate that the proposed method can generate more reliable explanations than state-of-the-art methods while maintaining answering accuracy.

Funder

JSPS KAKENHI

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/12/10/2183/pdf

Reference61 articles.

1. The forthcoming Artificial Intelligence (AI) revolution: Its impact on society and firms;Makridakis;Futures,2017

2. Kang, J.S., Kang, J., Kim, J.J., Jeon, K.W., Chung, H.J., and Park, B.H. (2023). Neural Architecture Search Survey: A Computer Vision Perspective. Sensors, 23.

3. Natural language processing: State of the art, current trends and challenges;Khurana;Multimed. Tools Appl.,2023

4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7–13). VQA: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile.

5. Czimmermann, T., Ciuti, G., Milazzo, M., Chiurazzi, M., Roccella, S., Oddo, C.M., and Dario, P. (2020). Visual-based defect detection and classification approaches for industrial applications—A survey. Sensors, 20.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Cascaded Searching Reinforcement Learning Agent for Proposal-Free Weakly-Supervised Phrase Comprehension;Electronics;2024-02-27

2. Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering;Journal of Imaging;2024-02-23

3. Next-Gen Language Mastery: Exploring Advances in Natural Language Processing Post-transformers;Lecture Notes in Networks and Systems;2024