Affiliation:
1. South China University of Technology, China and JD Explore Academy, China
2. University of Sydney, Australia
3. Harbin Institute of Technology, China
4. South China University of Technology, China
5. JD Explore Academy, China
6. School of Electronic and Information Engineering, China and Pazhou Lab, China
Abstract
The field of multimedia research has witnessed significant interest in leveraging multimodal pretrained neural network models to perceive and represent the physical world. Among these models, vision-language pretraining (VLP) has emerged as a captivating topic. Currently, the prevalent approach in VLP involves supervising the training process with paired image-text data. However, limited efforts have been dedicated to exploring the extraction of essential linguistic knowledge, such as semantics and syntax, during VLP and understanding its impact on multimodal alignment. In response, our study aims to shed light on the influence of comprehensive linguistic knowledge encompassing semantic expression and syntactic structure on multimodal alignment. To achieve this, we introduce SNARE, a large-scale multimodal alignment probing benchmark designed specifically for the detection of vital linguistic components, including lexical, semantic, and syntax knowledge. SNARE offers four distinct tasks: Semantic Structure, Negation Logic, Attribute Ownership, and Relationship Composition. Leveraging SNARE, we conduct holistic analyses of six advanced VLP models (BLIP, CLIP, Flava, X-VLM, BLIP2, and GPT-4), along with human performance, revealing key characteristics of the VLP model:
i)
Insensitivity to complex syntax structures, relying primarily on content words for sentence comprehension.
ii)
Limited comprehension of sentence combinations and negations.
iii)
Challenges in determining actions or spatial relations within visual information, as well as difficulties in verifying the correctness of ternary relationships. Based on these findings, we propose the following strategies to enhance multimodal alignment in VLP: 1) Utilize a large generative language model as the language backbone in VLP to facilitate the understanding of complex sentences. 2) Establish high-quality datasets that emphasize content words and employ simple syntax, such as short-distance semantic composition, to improve multimodal alignment. 3) Incorporate more fine-grained visual knowledge, such as spatial relationships, into pretraining objectives.
1
Publisher
Association for Computing Machinery (ACM)
Reference68 articles.
1. Morris Alper, Michael Fiman, and Hadar Averbuch-Elor. 2023. Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6778–6788.
2. Curriculum learning
3. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs
4. Probing the Need for Visual Context in Multimodal Machine Translation
5. Jize Cao, Zhe Gan, Yu Cheng, Licheng Yu, Yen-Chun Chen, and Jingjing Liu. 2020. Behind the scene: Revealing the secrets of pre-trained vision-and-language models. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VI 16. Springer, 565–580.