Prompt-Enhanced Generation for Multimodal Open Question Answering

Author:

Cui Chenhao1,Li Zhoujun2ORCID

Affiliation:

1. School of Cyber Science and Technology, Beihang University, Beijing 100191, China

2. School of Computer Science and Engineering, Beihang University, Beijing 100191, China

Abstract

Multimodal open question answering involves retrieving relevant information from both images and their corresponding texts given a question and then generating the answer. The quality of the generated answer heavily depends on the quality of the retrieved image–text pairs. Existing methods encode and retrieve images and texts, inputting the retrieved results into a language model to generate answers. These methods overlook the semantic alignment of image–text pairs within the information source, which affects the encoding and retrieval performance. Furthermore, these methods are highly dependent on retrieval performance, and poor retrieval quality can lead to poor generation performance. To address these issues, we propose a prompt-enhanced generation model, PEG, which includes generating supplementary descriptions for images to provide ample material for image–text alignment while also utilizing vision–language joint encoding to improve encoding effects and thereby enhance retrieval performance. Contrastive learning is used to enhance the model’s ability to discriminate between relevant and irrelevant information sources. Moreover, we further explore the knowledge within pre-trained model parameters through prefix-tuning to generate background knowledge relevant to the questions, offering additional input for answer generation and reducing the model’s dependency on retrieval performance. Experiments conducted on the WebQA and MultimodalQA datasets demonstrate that our model outperforms other baseline models in retrieval and generation performance.

Funder

National Natural Science Foundation of China

Fund of the State Key Laboratory of Software 464 Development Environment

Publisher

MDPI AG

Reference38 articles.

1. Hannan, D., Jain, A., and Bansal, M. (2020, January 7–12). Manymodalqa: Modality disambiguation and qa over diverse inputs. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.

2. Reddy, R.G., Rui, X., Li, M., Lin, X., Wen, H., Cho, J., Huang, L., Bansal, M., Sil, A., and Chang, S.F. (March, January 22). Mumuqa: Multimedia multi-hop news question answering via cross-media knowledge extraction and grounding. Proceedings of the AAAI Conference on Artificial Intelligence, Virtually.

3. Chang, Y., Narang, M., Suzuki, H., Cao, G., Gao, J., and Bisk, Y. (2022, January 18–24). WebQA: Multihop and Multimodal QA. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA.

4. Talmor, A., Yoran, O., Catav, A., Lahav, D., Wang, Y., Asai, A., Ilharco, G., Hajishirzi, H., and Berant, J. (2021, January 4). MultiModalQA: Complex question answering over text, tables and images. Proceedings of the International Conference on Learning Representations, Vienna, Austria.

5. Unified Vision-Language Pre-Training for Image Captioning and VQA;Zhou;Proc. Aaai Conf. Artif. Intell.,2020

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3