Multimodal Natural Language Explanation Generation for Visual Question Answering Based on Multiple Reference Data
-
Published:2023-05-10
Issue:10
Volume:12
Page:2183
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Zhu He1ORCID, Togo Ren2ORCID, Ogawa Takahiro2ORCID, Haseyama Miki2ORCID
Affiliation:
1. Graduate School of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan 2. Faculty of Information Science and Technology, Hokkaido University, N-14, W-9, Kita-ku, Sapporo 060-0814, Japan
Abstract
As deep learning research continues to advance, interpretability is becoming as important as model performance. Conducting interpretability studies to understand the decision-making processes of deep learning models can improve performance and provide valuable insights for humans. The interpretability of visual question answering (VQA), a crucial task for human–computer interaction, has garnered the attention of researchers due to its wide range of applications. The generation of natural language explanations for VQA that humans can better understand has gradually supplanted heatmap representations as the mainstream focus in the field. Humans typically answer questions by first identifying the primary objects in an image and then referring to various information sources, both within and beyond the image, including prior knowledge. However, previous studies have only considered input images, resulting in insufficient information that can lead to incorrect answers and implausible explanations. To address this issue, we introduce multiple references in addition to the input image. Specifically, we propose a multimodal model that generates natural language explanations for VQA. We introduce outside knowledge using the input image and question and incorporate object information into the model through an object detection module. By increasing the information available during the model generation process, we significantly improve VQA accuracy and the reliability of the generated explanations. Moreover, we employ a simple and effective feature fusion joint vector to combine information from multiple modalities while maximizing information preservation. Qualitative and quantitative evaluation experiments demonstrate that the proposed method can generate more reliable explanations than state-of-the-art methods while maintaining answering accuracy.
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference61 articles.
1. The forthcoming Artificial Intelligence (AI) revolution: Its impact on society and firms;Makridakis;Futures,2017 2. Kang, J.S., Kang, J., Kim, J.J., Jeon, K.W., Chung, H.J., and Park, B.H. (2023). Neural Architecture Search Survey: A Computer Vision Perspective. Sensors, 23. 3. Natural language processing: State of the art, current trends and challenges;Khurana;Multimed. Tools Appl.,2023 4. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., and Parikh, D. (2015, January 7–13). VQA: Visual question answering. Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile. 5. Czimmermann, T., Ciuti, G., Milazzo, M., Chiurazzi, M., Roccella, S., Oddo, C.M., and Dario, P. (2020). Visual-based defect detection and classification approaches for industrial applications—A survey. Sensors, 20.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|