A transformer-based Urdu image caption generation-Reference-Cited by-同舟云学术

A transformer-based Urdu image caption generation

Published:2024-07-02 Issue:9 Volume:15 Page:3441-3457
ISSN:1868-5137
Container-title:Journal of Ambient Intelligence and Humanized Computing
language:en
Short-container-title:J Ambient Intell Human Comput

Author:

Hadi Muhammad,Safder Iqra,Waheed Hajra,Zaman Farooq,Aljohani Naif Radi,Nawaz Raheel,Hassan Saeed Ul,Sarwar Raheem^ORCID

Abstract

AbstractImage caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s12652-024-04824-9.pdf

Reference55 articles.

1. Afzal MK, Shardlow M, Tuarob S et al (2023) Generative image captioning in Urdu using deep learning. J Ambient Intell Humaniz Comput 14(6):7719–31

2. Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5561–5570

3. Antol S, Agrawal A, Lu J et al (2015) Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision, pp 2425–2433

4. Bakar A, Sarwar R, Hassan SU et al (2023) Extracting algorithmic complexity in scientific literature for advance searching. J Comput Appl Linguist 1:39–65

5. Bouchard C, Omhover Jf, Mougenot C, et al (2008) Trends: a content-based information retrieval system for designers. In: Design Computing and Cognition’08: Proceedings of the Third International Conference on Design Computing and Cognition. Springer, pp 593–611