Efficiently Gluing Pre-trained Language and Vision Models for Image Captioning-Reference-Cited by-同舟云学术

Efficiently Gluing Pre-trained Language and Vision Models for Image Captioning

Published:2024-07-29 Issue: Volume: Page:
ISSN:2157-6904
Container-title:ACM Transactions on Intelligent Systems and Technology
language:en
Short-container-title:ACM Trans. Intell. Syst. Technol.

Author:

Song Peipei¹^ORCID,Zhou Yuanen²^ORCID,Yang Xun¹^ORCID,Liu Daqing³^ORCID,Hu Zhenzhen⁴^ORCID,Wang Depeng⁴^ORCID,Wang Meng⁵^ORCID

Affiliation:

1. University of Science and Technology of China, China

2. Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China

3. JD Explore Academy, China

4. Hefei University of Technology, China

5. Hefei University of Technology and Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, China

Abstract

Vision-and-language pre-training models have achieved impressive performance for image captioning. But most of them are trained with millions of paired image-text data and require huge memory and computing overhead. To alleviate this, we try to stand on the shoulders of large-scale pre-trained language models (PLM) and pre-trained vision models (PVM) and efficiently connect them for image captioning. There are two major challenges, one is that language and vision modalities have different semantic granularity ( e.g. , a noun may cover many pixels) and the other is that the semantic gap still exists between the pre-trained language and vision models. To this end, we design a lightweight and efficient connector to glue PVM and PLM, which holds a criterion of selection-then-transformation . Specifically, in the selection phase, we treat each image as a set of patches instead of pixels. We select salient image patches and cluster them into visual regions to align with text. Then, to effectively reduce the semantic gap, we propose to map the selected image patches into text space through spatial and channel transformations. With training on image captioning datasets, the connector learns to bridge the semantic granularity and semantic gap via backpropagation, preparing for the PLM to generate descriptions. Experimental results on the MSCOCO and Flickr30k datasets demonstrate that our method yields comparable performance to existing works. By solely training the small connector, we achieve a CIDEr performance of 132.2% on the MSCOCO Karpathy test split. Moreover, our findings reveal that fine-tuning the PLM can further enhance performance potential, resulting in a CIDEr score of 140.6%. Code and models are available at https://github.com/YuanEZhou/PrefixCap .

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3682067

Reference80 articles.

1. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

2. Manuele Barraco, Marcella Cornia, Silvia Cascianelli, Lorenzo Baraldi, and Rita Cucchiara. 2022. The Unreasonable Effectiveness of CLIP Features for Image Captioning: An Experimental Analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022. 4661–4669.

3. Jun Chen, Han Guo, Kai Yi, Boyang Li, and Mohamed Elhoseiny. 2021. VisualGPT: Data-efficient adaptation of pretrained language models for image captioning. arXiv preprint arXiv:2102.10407 (2021).

4. Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).

5. Jaemin Cho, Jie Lei, Hao Tan, and Mohit Bansal. 2021. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning. PMLR, 1931–1942.