Deep Learning Approaches on Image Captioning: A Review-Reference-Cited by-同舟云学术

Deep Learning Approaches on Image Captioning: A Review

Published:2023-10-05 Issue:3 Volume:56 Page:1-39
ISSN:0360-0300
Container-title:ACM Computing Surveys
language:en
Short-container-title:ACM Comput. Surv.

Author:

Ghandi Taraneh¹^ORCID,Pourreza Hamidreza²^ORCID,Mahyar Hamidreza¹^ORCID

Affiliation:

1. McMaster University, Canada

2. Ferdowsi University of Mashhad, Iran

Abstract

Image captioning is a research area of immense importance, aiming to generate natural language descriptions for visual content in the form of still images. The advent of deep learning and more recently vision-language pre-training techniques has revolutionized the field, leading to more sophisticated methods and improved performance. In this survey article, we provide a structured review of deep learning methods in image captioning by presenting a comprehensive taxonomy and discussing each method category in detail. Additionally, we examine the datasets commonly employed in image captioning research, as well as the evaluation metrics used to assess the performance of different captioning models. We address the challenges faced in this field by emphasizing issues such as object hallucination, missing context, illumination conditions, contextual understanding, and referring expressions. We rank different deep learning methods’ performance according to widely used evaluation metrics, giving insight into the current state-of-the-art. Furthermore, we identify several potential future directions for research in this area, which include tackling the information misalignment problem between image and text modalities, mitigating dataset bias, incorporating vision-language pre-training methods to enhance caption generation, and developing improved evaluation tools to accurately measure the quality of image captions.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science,Theoretical Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3617592

Reference149 articles.

1. nocaps: novel object captioning at scale

2. Multi-Modal Image Captioning for the Visually Impaired

3. Comparing deep learning models for low-light natural scene image enhancement and their impact on object detection and classification: Overview, empirical evaluation, and challenges

4. Low-Light Image Enhancement Using Image-to-Frequency Filter Learning

5. Flamingo: A visual language model for few-shot learning;Adv. Neural Inf. Process. Syst.,2022

Cited by 30 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. CheXReport: A transformer-based architecture to generate chest X-ray reports suggestions;Expert Systems with Applications;2024-12

2. Image captioning by diffusion models: A survey;Engineering Applications of Artificial Intelligence;2024-12

3. Attention-based image captioning for structural health assessment of apartment buildings;Automation in Construction;2024-11

4. Interactive dual-stream contrastive learning for radiology report generation;Journal of Biomedical Informatics;2024-09

5. Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing;Big Data Research;2024-08