TOMGPT: Reliable Text-Only Training Approach for Cost-Effective Multi-modal Large Language Model

Author:

Chen Yunkai1ORCID,Wang Qimeng2ORCID,Wu Shiwei1ORCID,Gao Yan2ORCID,Xu Tong1ORCID,Hu Yao2ORCID

Affiliation:

1. University of Science and Technology of China, Hefei, China

2. Xiaohongshu Inc., Beijing, China

Abstract

Multi-modal large language models (MLLMs), such as GPT-4, exhibit great comprehension capabilities on human instruction, as well as zero-shot ability on new downstream multi-modal tasks. To integrate the different modalities within a unified embedding space, previous MLLMs attempted to conduct visual instruction tuning with massive and high-quality image-text pair data, which requires substantial costs in data collection and training resources. In this article, we propose TOMGPT (Text-Only training Multi-modal GPT), a cost-effective MLLM tuned solely on easily accessible text data with much fewer resources. Along with pre-trained visual-linguistic coupled modality space (e.g., CLIP and ALIGN model), a text-only training strategy is devised to further project the aligned multi-modal latent space to that of LLM, endowing the LLM with visual comprehension capabilities in an efficient manner. Instead of enormous image-text training data required by previous MLLMs, we find that TOMGPT can be well-tuned with fewer yet diverse GPT-generated free-form text data, as we establish the semantic connection between LLM and pre-trained vision-language model. A quantitative evaluation is conducted on both MME and LVLM, which are recently released and extensively utilized MLLM benchmarks. The experiments reveal that TOMGPT achieved reliable performance compared to numerous models trained on a large amount of image-text pair data. Case studies are also presented, demonstrating TOMGPT’s broad understanding and dialogue capabilities across diverse image categories.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Reference56 articles.

1. Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katherine Millican Malcolm Reynolds Roman Ring Eliza Rutherford Serkan Cabi Tengda Han Zhitao Gong Sina Samangooei Marianne Monteiro Jacob L. Menick Sebastian Borgeaud Andy Brock Aida Nematzadeh Sahand Sharifzadeh Mikoł aj Bińkowski Ricardo Barreira Oriol Vinyals Andrew Zisserman and Karén Simonyan. 2022. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems Curran Associates Inc. 35 (2022) 23716–23736.

2. The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis

3. Language models are few-shot learners;Brown Tom;In Proceedings of the 34th International Conference on Advances in Neural Information Processing Systems,2020

4. A survey on evaluation of large language models;Chang Yupeng;ACM Transactions on Intelligent Systems and Technology,2023

5. Zhihong Chen Guiming Chen Shizhe Diao Xiang Wan and Benyou Wang. 2023. On the difference of BERT-style and CLIP-style text encoders. In Findings of the Association for Computational Linguistics: (ACL 2023). Association for Computational Linguistics Toronto Canada 13710–13721.

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video Retrieval;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-09-12

2. Adversarial attacks and defenses for large language models (LLMs): methods, frameworks & challenges;International Journal of Multimedia Information Retrieval;2024-06-25

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3