Author:
Subedi Nabaraj,Paudel Nirajan,Chhetri Manish,Acharya Sudarshan,Lamichhane Nabin
Abstract
The advent of deep neural networks has made the image captioning task more feasible. It is a method of generating text by analyzing the different parts of an image. A lot of tasks related to this have been done in the English language, while very little effort is put into this task in other languages, particularly the Nepali language. It is an even harder task to carry out research in the Nepali language because of its difficult grammatical structure and vast language domain. Further, the little work done in the Nepali language is done to generate only a single sentence, but the proposed work emphasizes generating paragraph-long coherent sentences. The Stanford human genome dataset, which was translated into Nepali language using the Google Translate API is used in the proposed work. Along with this, a manually curated dataset consisting of 800 images of the cultural sites of Nepal, along with their Nepali captions, was also used. These two datasets were combined to train the deep learning model. The task involved working with transformer architecture. In this setup, image features were extracted using a pretrained Inception V3 model. These features were then inputted into the encoder segment after position encoding. Simultaneously, embedded tokens from captions were fed into the decoder segment. The resulting captions were assessed using BLEU scores, revealing higher accuracy and BLEU scores for the test images.
Publisher
Inventive Research Organization
Reference17 articles.
1. [1] Krause, Jonathan, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. "A hierarchical approach for generating descriptive image paragraphs." In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 317-325. 2017.
2. [2] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
3. [3] Ekman, Magnus. Learning deep learning: Theory and practice of neural networks, computer vision, natural language processing, and transformers using TensorFlow. Addison-Wesley Professional, 2021.
4. [4] A. Adhikari and S. Ghimire, “Nepali Image Captioning,” in 2019 Artificial Intelligence for Transforming Business and Society (AITB), Kathmandu, Nepal: IEEE, Nov. 2019, pp. 1–6. doi: 10.1109/AITB48515.2019.8947436.
5. [5] R. Budhathoki and S. Timilsina, “Image Captioning in Nepali Using CNN and Transformer Decoder,” J. Eng. Sci., vol. 2, no. 1, pp. 41–48, Dec. 2023, doi: 10.3126/jes2.v2i1.60391.