MSGeN: Multimodal Selective Generation Network for Grounded Explanations
-
Published:2023-12-29
Issue:1
Volume:13
Page:152
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Li Dingbang1ORCID, Chen Wenzhou2ORCID, Lin Xin1
Affiliation:
1. School of Computer Science and Technology, East China Normal University, Shanghai 200241, China 2. College of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China
Abstract
Modern models have shown impressive capabilities in visual reasoning tasks. However, the interpretability of their decision-making processes remains a challenge, causing uncertainty in their reliability. In response, we present the Multimodal Selective Generation Network (MSGeN), a novel approach to enhancing interpretability and transparency in visual reasoning. MSGeN can generate explanations that seamlessly integrate diverse modal information, providing a comprehensive and intuitive understanding of its decisions. The model consists of five collaborative components: (1) the Multimodal Encoder, which encodes and fuses input data; (2) the Reasoner, which is responsible for generating stepwise inference states; (3) the Selector, which is utilized for selecting the modality for each step’s explanation; (4) the Speaker, which generates natural language descriptions; and (5) the Pointer, which produces visual cues. These components work harmoniously to generate explanations enriched with natural language context and visual cues. Our extensive experimentation demonstrates that MSGeN surpasses existing multimodal explanation generation models across various metrics, including BLEU, METEOR, ROUGE, CIDEr, SPICE, and Grounding. We also show detailed visual examples highlighting MSGeN’s ability to generate comprehensive and coherent explanations, showcasing its effectiveness through practical case studies.
Funder
Science and Technology Commission of Shanghai Municipality National Key Research and Development Program of China
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference74 articles.
1. Chen, S., Jin, Q., Wang, P., and Wu, Q. (2020, January 13–19). Say as you wish: Fine-grained control of image caption generation with abstract scene graphs. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 2. Wang, X., Liu, Y., Shen, C., Ng, C.C., Luo, C., Jin, L., Chan, C.S., van den Hengel, A., and Wang, L. (2020, January 13–19). On the general value of evidence, and bilingual scene-text visual question answering. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 3. Kottur, S., Moura, J.M., Parikh, D., Batra, D., and Rohrbach, M. (2019). Clevr-dialog: A diagnostic dataset for multi-round reasoning in visual dialog. arXiv. 4. Yang, Z., Chen, T., Wang, L., and Luo, J. (2020, January 23–28). Improving one-stage visual grounding by recursive sub-query construction. Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK. Proceedings, Part XIV 16. 5. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., and Zhang, L. (2018, January 18–23). Bottom-up and top-down attention for image captioning and visual question answering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA.
|
|