Can Linguistic Knowledge Improve Multimodal Alignment in Vision-Language Pretraining?

Author:

Wang Fei1ORCID,Ding Liang2ORCID,Rao Jun3ORCID,Liu Ye4ORCID,Shen Li5ORCID,Ding Changxing6ORCID

Affiliation:

1. South China University of Technology, China and JD Explore Academy, China

2. University of Sydney, Australia

3. Harbin Institute of Technology, China

4. South China University of Technology, China

5. JD Explore Academy, China

6. School of Electronic and Information Engineering, China and Pazhou Lab, China

Abstract

The field of multimedia research has witnessed significant interest in leveraging multimodal pretrained neural network models to perceive and represent the physical world. Among these models, vision-language pretraining (VLP) has emerged as a captivating topic. Currently, the prevalent approach in VLP involves supervising the training process with paired image-text data. However, limited efforts have been dedicated to exploring the extraction of essential linguistic knowledge, such as semantics and syntax, during VLP and understanding its impact on multimodal alignment. In response, our study aims to shed light on the influence of comprehensive linguistic knowledge encompassing semantic expression and syntactic structure on multimodal alignment. To achieve this, we introduce SNARE, a large-scale multimodal alignment probing benchmark designed specifically for the detection of vital linguistic components, including lexical, semantic, and syntax knowledge. SNARE offers four distinct tasks: Semantic Structure, Negation Logic, Attribute Ownership, and Relationship Composition. Leveraging SNARE, we conduct holistic analyses of six advanced VLP models (BLIP, CLIP, Flava, X-VLM, BLIP2, and GPT-4), along with human performance, revealing key characteristics of the VLP model: i) Insensitivity to complex syntax structures, relying primarily on content words for sentence comprehension. ii) Limited comprehension of sentence combinations and negations. iii) Challenges in determining actions or spatial relations within visual information, as well as difficulties in verifying the correctness of ternary relationships. Based on these findings, we propose the following strategies to enhance multimodal alignment in VLP: 1) Utilize a large generative language model as the language backbone in VLP to facilitate the understanding of complex sentences. 2) Establish high-quality datasets that emphasize content words and employ simple syntax, such as short-distance semantic composition, to improve multimodal alignment. 3) Incorporate more fine-grained visual knowledge, such as spatial relationships, into pretraining objectives. 1

Publisher

Association for Computing Machinery (ACM)

Reference68 articles.

1. Morris Alper, Michael Fiman, and Hadar Averbuch-Elor. 2023. Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6778–6788.

2. Curriculum learning

3. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs

4. Probing the Need for Visual Context in Multimodal Machine Translation

5. Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. Springer, 565–580.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3