Prompt-Enhanced Generation for Multimodal Open Question Answering-Reference-Cited by-同舟云学术

Prompt-Enhanced Generation for Multimodal Open Question Answering

Published:2024-04-10 Issue:8 Volume:13 Page:1434
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Cui Chenhao¹,Li Zhoujun²^ORCID

Affiliation:

1. School of Cyber Science and Technology, Beihang University, Beijing 100191, China

2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China

Abstract

Multimodal open question answering involves retrieving relevant information from both images and their corresponding texts given a question and then generating the answer. The quality of the generated answer heavily depends on the quality of the retrieved image–text pairs. Existing methods encode and retrieve images and texts, inputting the retrieved results into a language model to generate answers. These methods overlook the semantic alignment of image–text pairs within the information source, which affects the encoding and retrieval performance. Furthermore, these methods are highly dependent on retrieval performance, and poor retrieval quality can lead to poor generation performance. To address these issues, we propose a prompt-enhanced generation model, PEG, which includes generating supplementary descriptions for images to provide ample material for image–text alignment while also utilizing vision–language joint encoding to improve encoding effects and thereby enhance retrieval performance. Contrastive learning is used to enhance the model’s ability to discriminate between relevant and irrelevant information sources. Moreover, we further explore the knowledge within pre-trained model parameters through prefix-tuning to generate background knowledge relevant to the questions, offering additional input for answer generation and reducing the model’s dependency on retrieval performance. Experiments conducted on the WebQA and MultimodalQA datasets demonstrate that our model outperforms other baseline models in retrieval and generation performance.

Funder

National Natural Science Foundation of China

Fund of the State Key Laboratory of Software 464 Development Environment

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/8/1434/pdf

Reference38 articles.

1. Hannan, D., Jain, A., and Bansal, M. (2020, January 7–12). Manymodalqa: Modality disambiguation and qa over diverse inputs. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.

2. Reddy, R.G., Rui, X., Li, M., Lin, X., Wen, H., Cho, J., Huang, L., Bansal, M., Sil, A., and Chang, S.F. (March, January 22). Mumuqa: Multimedia multi-hop news question answering via cross-media knowledge extraction and grounding. Proceedings of the AAAI Conference on Artificial Intelligence, Virtually.

3. Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., and Bisk, Y. (2022, January 18–24). WebQA: Multihop and Multimodal QA. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.

4. Talmor, A., Yoran, O., Catav, A., Lahav, D., Wang, Y., Asai, A., Ilharco, G., Hajishirzi, H., and Berant, J. (2021, January 4). MultiModalQA: Complex question answering over text, tables and images. Proceedings of the International Conference on Learning Representations, Vienna, Austria.

5. Unified Vision-Language Pre-Training for Image Captioning and VQA;Zhou;Proc. Aaai Conf. Artif. Intell.,2020