Affiliation:
1. School of Electrical Engineering and Automation, Xiamen University of Technology, Xiamen 361024, China
2. Xiamen Key Laboratory of Frontier Electric Power Equipment and Intelligent Control, Xiamen 361024, China
Abstract
Grounded Visual Question Answering systems place heavy reliance on substantial computational power and data resources in pretraining. In response to this challenge, this paper introduces the LCV2 modular approach, which utilizes a frozen large language model (LLM) to bridge the off-the-shelf generic visual question answering (VQA) module with a generic visual grounding (VG) module. It leverages the generalizable knowledge of these expert models, avoiding the need for any large-scale pretraining. Innovatively, within the LCV2 framework, question and predicted answer pairs are transformed into descriptive and referring captions, enhancing the clarity of the visual cues directed by the question text for the VG module’s grounding. This compensates for the limitations of missing intrinsic text–visual coupling in non-end-to-end frameworks. Comprehensive experiments on benchmark datasets, such as GQA, CLEVR, and VizWiz-VQA-Grounding, were conducted to evaluate the method’s performance and compare it with several baseline methods. In particular, it achieved an IoU F1 score of 59.6% on the GQA dataset and an IoU F1 score of 37.4% on the CLEVR dataset, surpassing some baseline results and demonstrating the LCV2’s competitive performance.
Funder
Science and Technology Program of State Grid East China Branch