A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages-Reference-Cited by-同舟云学术

A Systematic Literature Review on Using the Encoder-Decoder Models for Image Captioning in English and Arabic Languages

Published:2023-09-30 Issue:19 Volume:13 Page:10894
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Alsayed Ashwaq¹^ORCID,Arif Muhammad¹^ORCID,Qadah Thamir M.¹^ORCID,Alotaibi Saud²^ORCID

Affiliation:

1. Computer Science Department, Umm Al-Qura University, Makkah 24230, Saudi Arabia

2. Information Systems Department, Umm Al-Qura University, Makkah 24230, Saudi Arabia

Abstract

With the explosion of visual content on the Internet, creating captions for images has become a necessary task and an exciting topic for many researchers. Furthermore, image captioning is becoming increasingly important as the number of people utilizing social media platforms grows. While there is extensive research on English image captioning (EIC), studies focusing on image captioning in other languages, especially Arabic, are limited. There has also yet to be an attempt to survey Arabic image captioning (AIC) systematically. This research aims to systematically survey encoder-decoder EIC while considering the following aspects: visual model, language model, loss functions, datasets, evaluation metrics, model comparison, and adaptability to the Arabic language. A systematic review of the literature on EIC and AIC approaches published in the past nine years (2015–2023) from well-known databases (Google Scholar, ScienceDirect, IEEE Xplore) is undertaken. We have identified 52 primary English and Arabic studies relevant to our objectives (The number of articles on Arabic captioning is 11, and the rest are for the English language). The literature review shows that applying the English-specific models to the Arabic language is possible, with the use of a high-quality Arabic database and following the appropriate preprocessing. Moreover, we discuss some limitations and ideas to solve them as a future direction.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/19/10894/pdf

Reference114 articles.

1. Framing image description as a ranking task: Data, models and evaluation metrics;Hodosh;J. Artif. Intell. Res.,2013

2. Kiros, R., Salakhutdinov, R., and Zemel, R. (2014, January 22–24). Multimodal neural language models. Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China.

3. Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015, January 7–9). Show, attend and tell: Neural image caption generation with visual attention. Proceedings of the International Conference on Machine Learning, PMLR, Lille, France.

4. Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. (2015, January 7–12). Show and tell: A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA.

5. Chen, L., Zhang, H., Xiao, J., Nie, L., Shao, J., Liu, W., and Chua, T.S. (2017, January 21–26). Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing;Big Data Research;2024-08