Abstract
AbstractImage caption generation has emerged as a remarkable development that bridges the gap between Natural Language Processing (NLP) and Computer Vision (CV). It lies at the intersection of these fields and presents unique challenges, particularly when dealing with low-resource languages such as Urdu. Limited research on basic Urdu language understanding necessitates further exploration in this domain. In this study, we propose three Seq2Seq-based architectures specifically tailored for Urdu image caption generation. Our approach involves leveraging transformer models to generate captions in Urdu, a significantly more challenging task than English. To facilitate the training and evaluation of our models, we created an Urdu-translated subset of the flickr8k dataset, which contains images featuring dogs in action accompanied by corresponding Urdu captions. Our designed models encompassed a deep learning-based approach, utilizing three different architectures: Convolutional Neural Network (CNN) + Long Short-term Memory (LSTM) with Soft attention employing word2Vec embeddings, CNN+Transformer, and Vit+Roberta models. Experimental results demonstrate that our proposed model outperforms existing state-of-the-art approaches, achieving 86 BLEU-1 and 90 BERT-F1 scores. The generated Urdu image captions exhibit syntactic, contextual, and semantic correctness. Our study highlights the inherent challenges associated with retraining models on low-resource languages. Our findings highlight the potential of pre-trained models for facilitating the development of NLP and CV applications in low-resource language settings.
Publisher
Springer Science and Business Media LLC