Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering-Reference-Cited by-同舟云学术

Collaborative Modality Fusion for Mitigating Language Bias in Visual Question Answering

Published:2024-02-23 Issue:3 Volume:10 Page:56
ISSN:2313-433X
Container-title:Journal of Imaging
language:en
Short-container-title:J. Imaging

Author:

Lu Qiwen¹,Chen Shengbo¹,Zhu Xiaoke¹

Affiliation:

1. School of Computer and Information Engineering, Henan University, Kaifeng 475001, China

Abstract

Language bias stands as a noteworthy concern in visual question answering (VQA), wherein models tend to rely on spurious correlations between questions and answers for prediction. This prevents the models from effectively generalizing, leading to a decrease in performance. In order to address this bias, we propose a novel modality fusion collaborative de-biasing algorithm (CoD). In our approach, bias is considered as the model’s neglect of information from a particular modality during prediction. We employ a collaborative training approach to facilitate mutual modeling between different modalities, achieving efficient feature fusion and enabling the model to fully leverage multimodal knowledge for prediction. Our experiments on various datasets, including VQA-CP v2, VQA v2, and VQA-VS, using different validation strategies, demonstrate the effectiveness of our approach. Notably, employing a basic baseline model resulted in an accuracy of 60.14% on VQA-CP v2.

Funder

Key Scientific and Technological Project of Henan Province of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2313-433X/10/3/56/pdf

Reference30 articles.

1. Depth and video segmentation based visual attention for embodied question answering;Luo;IEEE Trans. Pattern Anal. Mach. Intell.,2023

2. Machine learning-based human-robot interaction in its;Wang;Inf. Process. Manag.,2022

3. Han, X., Wang, S., Su, C., Huang, Q., and Tian, Q. (2021, January 11–17). Greedy gradient ensemble for robust visual question answering. Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada.

4. Debiased visual question answering from feature and sample perspectives;Wen;Adv. Neural Inf. Process. Syst.,2021

5. Be flexible! learn to debias by sampling and prompting for robust visual question answering;Liu;Inf. Process. Manag.,2023

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. VQA-PDF: Purifying Debiased Features for Robust Visual Question Answering Task;Lecture Notes in Computer Science;2024