LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering-Reference-Cited by-同舟云学术

LCV2: A Universal Pretraining-Free Framework for Grounded Visual Question Answering

Published:2024-05-25 Issue:11 Volume:13 Page:2061
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Chen Yuhan¹,Su Lumei¹²,Chen Lihua¹,Lin Zhiwei¹

Affiliation:

1. School of Electrical Engineering and Automation, Xiamen University of Technology, Xiamen 361024, China

2. Xiamen Key Laboratory of Frontier Electric Power Equipment and Intelligent Control, Xiamen 361024, China

Abstract

Grounded Visual Question Answering systems place heavy reliance on substantial computational power and data resources in pretraining. In response to this challenge, this paper introduces the LCV2 modular approach, which utilizes a frozen large language model (LLM) to bridge the off-the-shelf generic visual question answering (VQA) module with a generic visual grounding (VG) module. It leverages the generalizable knowledge of these expert models, avoiding the need for any large-scale pretraining. Innovatively, within the LCV2 framework, question and predicted answer pairs are transformed into descriptive and referring captions, enhancing the clarity of the visual cues directed by the question text for the VG module’s grounding. This compensates for the limitations of missing intrinsic text–visual coupling in non-end-to-end frameworks. Comprehensive experiments on benchmark datasets, such as GQA, CLEVR, and VizWiz-VQA-Grounding, were conducted to evaluate the method’s performance and compare it with several baseline methods. In particular, it achieved an IoU F1 score of 59.6% on the GQA dataset and an IoU F1 score of 37.4% on the CLEVR dataset, surpassing some baseline results and demonstrating the LCV2’s competitive performance.

Funder

Science and Technology Program of State Grid East China Branch

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/11/2061/pdf

Reference77 articles.

1. The multi-modal fusion in visual question answering: A review of attention mechanisms;Lu;PeerJ Comput. Sci.,2023

2. Chen, C., Anjum, S., and Gurari, D. (2022, January 19–24). Grounding answers for visual questions asked by visually impaired people. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA.

3. Massiceti, D., Anjum, S., and Gurari, D. (2022). VizWiz grand challenge workshop at CVPR 2022. ACM SIGACCESS Access. Comput., 1.

4. Vision skills needed to answer visual questions;Zeng;Proc. ACM Hum. Comput. Interact.,2020

5. Liu, Y., Pan, J., Wang, Q., Chen, G., Nie, W., Zhang, Y., Gao, Q., Hu, Q., and Zhu, P. (2023, January 22–23). Weakly-Supervised Grounding for VQA with Dual Visual-Linguistic Interaction. Proceedings of the CAAI International Conference on Artificial Intelligence, Fuzhou, China.