Improved Arabic image captioning model using feature concatenation with pre-trained word embedding-Reference-Cited by-同舟云学术

Improved Arabic image captioning model using feature concatenation with pre-trained word embedding

Published:2023-06-17 Issue:26 Volume:35 Page:19051-19067
ISSN:0941-0643
Container-title:Neural Computing and Applications
language:en
Short-container-title:Neural Comput & Applic

Author:

Elbedwehy Samar^ORCID,Medhat T.

Abstract

AbstractAutomatic captioning of images contributes to identifying features of multimedia content and helps in the detection of interesting patterns, trends, and occurrences. English image captioning has recently made incredible progress; however, Arabic image captioning is still lagging. In the field of machine learning, Arabic image-caption generation is generally a very difficult problem. This paper presents a more accurate model for Arabic image captioning by using transformer models in both the encoder and decoder phases as feature extractors from images in the encoder phase and a pre-trained word embedding model in the decoder phase. The models are demonstrated, and all of them are implemented, trained, and tested on Arabic Flickr8k datasets. For the image feature extraction subsystem, we compared using three different individual vision models (SWIN, XCIT, and ConvNexT) with concatenation to get among them the most expressive extracted feature vector of the image, and for the caption generation lingual subsystem, which is tested by four different pre-trained language embedding models: (ARABERT, ARAELECTRA, MARBERTv2, and CamelBERT), to select from them the most accurate pre-trained language embedding model. Our experiments showed that building an Arabic image captioning system that uses a concatenation of the three transformer-based models ConvNexT combined with SWIN and XCIT as an image feature extractor, combined with the CamelBERT language embedding model produces the best results among the other combinations, having scores of 0.5980 with BLEU-1 and with ConvNexT combined with SWIN the araelectra language embedding model having a score of 0.1664 with BLEU-4 which are higher than the previously reported values of 0.443 and 0.157.

Funder

Kafr El Shiekh University

Publisher

Springer Science and Business Media LLC

Subject

Artificial Intelligence,Software

Link

https://link.springer.com/content/pdf/10.1007/s00521-023-08744-1.pdf

Reference42 articles.

1. Amirkhani A, Barshooi AH (2022) DeepCar 5.0: vehicle make and model recognition under challenging conditions. IEEE Trans Intell Transp Syst. https://doi.org/10.1109/TITS.2022.3212921

2. Barshooi AH, Amirkhani A (2022) A novel data augmentation based on Gabor filter and convolutional deep learning for improving the classification of COVID-19 chest X-Ray images. Biomed Signal Process Control 72:103326

3. lJundi O, Dhaybi M, Mokadam K, Hajj HM and Asmar DC (2020) Resources and end-to-end neural network models for arabic image captioning In: VISIGRAPP (5: VISAPP), pp. 233–241

4. Attai A and Elnagar A (2020) A survey on arabic image captioning systems using deep learning models In: 14th international conference on innovations in information technology (IIT), pp. 114–119.

5. Monaf S (2021) Arabic image captioning using deep learning with attention. University of Georgia, Georgia.