Towards Retrieval-Augmented Architectures for Image Captioning-Reference-Cited by-同舟云学术

Towards Retrieval-Augmented Architectures for Image Captioning

Published:2024-06-12 Issue:8 Volume:20 Page:1-22
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Sarto Sara¹^ORCID,Cornia Marcella²^ORCID,Baraldi Lorenzo¹^ORCID,Nicolosi Alessandro³^ORCID,Cucchiara Rita¹^ORCID

Affiliation:

1. Department of Engineering "Enzo Ferrari", University of Modena and Reggio Emilia, Modena, Italy

2. Department of Education and Humanities, University of Modena and Reggio Emilia, Modena, Italy

3. Leonardo SpA, Roma, Italy

Abstract

The objective of image captioning models is to bridge the gap between the visual and linguistic modalities by generating natural language descriptions that accurately reflect the content of input images. In recent years, researchers have leveraged deep learning-based models and made advances in the extraction of visual features and the design of multimodal connections to tackle this task. This work presents a novel approach toward developing image captioning models that utilize an external k NN memory to improve the generation process. Specifically, we propose two model variants that incorporate a knowledge retriever component that is based on visual similarities, a differentiable encoder to represent input images, and a k NN-augmented language model to predict tokens based on contextual cues and text retrieved from the external memory. We experimentally validate our approach on COCO and nocaps datasets and demonstrate that incorporating an explicit external memory can significantly enhance the quality of captions, especially with a larger retrieval corpus. This work provides valuable insights into retrieval-augmented captioning models and opens up new avenues for improving image captioning at a larger scale.

Funder

PNRR-M4C2

FAIR - Future Artificial Intelligence Research

European Commission

CREATIVE: CRoss-modal understanding and gEnerATIon of Visual and tExtual content

Italian Ministry of University and Research

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3663667

Reference97 articles.

1. Harsh Agrawal, Karan Desai, Yufei Wang, Xinlei Chen, Rishabh Jain, Mark Johnson, Dhruv Batra, Devi Parikh, Stefan Lee, and Peter Anderson. 2019. Nocaps: Novel object captioning at scale. In Proceedings of the IEEE/CVF International Conference on Computer Vision.

2. Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikołaj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan. 2022. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems.

3. Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: Semantic propositional image caption evaluation. In Proceedings of the European Conference on Computer Vision.

4. Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.

5. Simran Arora Avanika Narayan Mayee F. Chen Laurel J. Orr Neel Guha Kush Bhatia Ines Chami and Christopher Re. 2023. Ask Me Anything: A simple strategy for prompting language models. In Proceedings of the International Conference on Learning Representations.