MMICT: Boosting Multi-Modal Fine-Tuning with In-Context Examples

Author:

Chen Tao1ORCID,Zhang Enwei2ORCID,Gao Yuting2ORCID,Li Ke2ORCID,Sun Xing2ORCID,Zhang Yan1ORCID,Li Hui1ORCID,Ji Rongrong1ORCID

Affiliation:

1. Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University, China

2. Tencent Youtu Lab, China

Abstract

Although In-Context Learning (ICL) brings remarkable performance gains to Large Language Models (LLMs), the improvements remain lower than fine-tuning on downstream tasks. This paper introduces Multi-Modal In-Context Tuning (MMICT), a novel multi-modal fine-tuning paradigm that boosts multi-modal fine-tuning by fully leveraging the promising ICL capability of multi-modal LLMs (MM-LLMs). We propose the Multi-Modal Hub (M-Hub), a unified module that captures various multi-modal features according to different inputs and objectives. Based on M-Hub, MMICT enables MM-LLMs to learn from in-context visual-guided textual features and subsequently generate outputs conditioned on the textual-guided visual features. Moreover, leveraging the flexibility of M-Hub, we design a variety of in-context demonstrations. Extensive experiments on a diverse range of downstream multi-modal tasks demonstrate that MMICT significantly outperforms traditional fine-tuning strategy and the vanilla ICT method that directly takes the concatenation of all information from different modalities as input. Our implementation is available at: https://github.com/KDEGroup/MMICT.

Publisher

Association for Computing Machinery (ACM)

Reference45 articles.

1. Jean-Baptiste Alayrac Jeff Donahue Pauline Luc Antoine Miech Iain Barr Yana Hasson Karel Lenc Arthur Mensch Katherine Millican Malcolm Reynolds Roman Ring Eliza Rutherford Serkan Cabi Tengda Han Zhitao Gong Sina Samangooei Marianne Monteiro Jacob L. Menick Sebastian Borgeaud Andy Brock Aida Nematzadeh Sahand Sharifzadeh Mikolaj Binkowski Ricardo Barreira Oriol Vinyals Andrew Zisserman and Karén Simonyan. 2022. Flamingo: a Visual Language Model for Few-Shot Learning. In NeurIPS. 23716–23736.

2. Rohan Anil Sebastian Borgeaud Yonghui Wu Jean-Baptiste Alayrac Jiahui Yu Radu Soricut Johan Schalkwyk Andrew M. Dai Anja Hauth Katie Millican David Silver Slav Petrov Melvin Johnson Ioannis Antonoglou Julian Schrittwieser Amelia Glaese Jilin Chen Emily Pitler Timothy P. Lillicrap Angeliki Lazaridou Orhan Firat James Molloy Michael Isard Paul Ronald Barham Tom Hennigan Benjamin Lee Fabio Viola Malcolm Reynolds Yuanzhong Xu Ryan Doherty Eli Collins Clemens Meyer Eliza Rutherford Erica Moreira Kareem Ayoub Megha Goel George Tucker Enrique Piqueras Maxim Krikun Iain Barr Nikolay Savinov Ivo Danihelka Becca Roelofs Anaïs White Anders Andreassen Tamara von Glehn Lakshman Yagati Mehran Kazemi Lucas Gonzalez Misha Khalman Jakub Sygnowski and et al. 2023. Gemini: A Family of Highly Capable Multimodal Models. arXiv Preprint (2023). https://arxiv.org/abs/2312.11805.

3. Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhang, Jing Shi, Shuang Xu, and Bo Xu. 2023. X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. arXiv Preprint (2023). https://arxiv.org/abs/2305.04160.

4. Jun Chen Han Guo Kai Yi Boyang Li and Mohamed Elhoseiny. 2022. VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning. In CVPR. 18009–18019.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3