A Review of Transformer-Based Approaches for Image Captioning-Reference-Cited by-同舟云学术

A Review of Transformer-Based Approaches for Image Captioning

Published:2023-10-09 Issue:19 Volume:13 Page:11103
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Ondeng Oscar¹^ORCID,Ouma Heywood¹^ORCID,Akuon Peter¹

Affiliation:

1. Department of Electrical and Information Engineering, University of Nairobi, Nairobi P.O. Box 30197-00100, Kenya

Abstract

Visual understanding is a research area that bridges the gap between computer vision and natural language processing. Image captioning is a visual understanding task in which natural language descriptions of images are automatically generated using vision-language models. The transformer architecture was initially developed in the context of natural language processing and quickly found application in the domain of computer vision. Its recent application to the task of image captioning has resulted in markedly improved performance. In this paper, we briefly look at the transformer architecture and its genesis in attention mechanisms. We more extensively review a number of transformer-based image captioning models, including those employing vision-language pre-training, which has resulted in several state-of-the-art models. We give a brief presentation of the commonly used datasets for image captioning and also carry out an analysis and comparison of the transformer-based captioning models. We conclude by giving some insights into challenges as well as future directions for research in this area.

Funder

African Development Bank

National Research Fund (NRF) Kenya-South Africa

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/19/11103/pdf

Reference146 articles.

1. Daniilidis, K., Maragos, P., and Paragios, N. (2010, January 5–11). Every Picture Tells a Story: Generating Sentences from Images. Proceedings of the Computer Vision—ECCV 2010, Crete, Greece.

2. Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., and Berg, T.L. (2011, January 20–25). Baby Talk: Understanding and Generating Simple Image Descriptions. Proceedings of the CVPR 2011, Colorado Springs, CO, USA.

3. Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A., Yamaguchi, K., Berg, T., Stratos, K., and Daumé, H. (2012, January 23–27). Midge: Generating Image Descriptions from Computer Vision Detections. Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France.

4. Kuznetsova, P., Ordonez, V., Berg, A., Berg, T., and Choi, Y. (2012, January 8–14). Collective Generation of Natural Image Descriptions. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jeju Island, Republic of Korea.

5. TreeTalk: Composition and Compression of Trees for Image Descriptions;Kuznetsova;Trans. Assoc. Comput. Linguist.,2014

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Image captioning by diffusion models: A survey;Engineering Applications of Artificial Intelligence;2024-12

2. DIC-Transformer: interpretation of plant disease classification results using image caption generation technology;Frontiers in Plant Science;2024-01-25