Abstract
Abstract
Image captioning provides the process of describing the content from an image The task of generating image captions considers object detection for single-line descriptions. To improve the quality of the generated caption, object detection features are applied. In this proposed work, features are extracted from improved YOLO V5 model. This improved YOLO V5 model enhances the performance of the object detection process. Xception V3 model is applied to generate the sequence of the word from predicted object feature. Finally the caption generated from Xception V3 is used to hear in voice and text with any selected language. Flickr 8k, Flicr30k and MSCOCO data sets are used for this proposed method. Natural Language Processing (NLP) is a technique used to understand the description of an image. This proposed method is very much used for visually impaired people. The results show that the proposed method provides 99.5% Accuracy, 99.1% Precision, 99.3% Recall, 99.4% F1 score on MS COCO data set using improved YOLO V5 model and Xception V3 model. Compared to the existing techniques, this proposed method shows 11–15% improved accuracy.
Publisher
Research Square Platform LLC