HCCL: H ierarchical C ounterfactual C ontrastive L earning for Robust Visual Question Answering-Reference-Cited by-同舟云学术

HCCL: H ierarchical C ounterfactual C ontrastive L earning for Robust Visual Question Answering

Published:2024-06-27 Issue: Volume: Page:
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Hao Dongze¹^ORCID,Wang Qunbo²^ORCID,Zhu Xinxin²^ORCID,Liu Jing¹^ORCID

Affiliation:

1. Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China and School of Artificial Intelligence, University of Chinese Academy of Sciences, China

2. Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, China

Abstract

Despite most state-of-the-art models having achieved amazing performance in visual question answering (VQA), they usually utilize biases to answer the question. Recently, some studies synthesize counterfactual training samples to help the model to mitigate the biases. However, these synthetic samples need extra annotations and often contain noises. Moreover, these methods simply add synthetic samples to the training data to train the model with the cross-entropy loss, which cannot make the best use of synthetic samples to mitigate the biases. In this paper, to mitigate the biases in VQA more effectively, we propose a H ierarchical C ounterfactual C ontrastive L earning method (HCCL). Firstly, to avoid introducing noises and extra annotations, our method automatically masks the unimportant features in original pairs to obtain positive samples and create mismatched question-image pairs as negative samples. Then our method uses feature-level and answer-level contrastive learning to make the original sample close to positive samples in the feature space, while away from negative samples in both feature and answer spaces. In this way, the VQA model can learn the robust multi-modal features and focus on both visual and language information to produce the answer. Our HCCL method can be adopted in different baselines and the experimental results on VQA v2, VQA-CP, and GQA-OOD datasets show that our method is effective in mitigating the biases in VQA, which improves the robustness of the VQA model.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3673902

Reference56 articles.

1. Ehsan Abbasnejad, Damien Teney, Amin Parvaneh, Javen Shi, and Anton van den Hengel. 2020. Counterfactual vision and language learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10044–10054.

2. Vedika Agarwal, Rakshith Shetty, and Mario Fritz. 2020. Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9690–9698.

3. Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering

4. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

5. Neural Module Networks