Abstract
Understanding the qualities of an image and converting them into a phrase or sentence that makes sense is the process of image captioning. Neuroscience research has only recently made clear the connection between human vision and language formation. Although there have been many methods for captioning images, including content retrieval and template filling, the current trend is toward deep learning-based methods. Using an image encoder, feature vectors are created from an image through the deep learning process, and a language decoder converts these feature vectors into a string of words. Using encoder-decoder approach or simple attention based has not provided so much efficient results. In the proposed model, double awareness-based mechanism has been used. The primary goal of this study is to extract visual properties from the region of interest (RoI) of an image as well as text features using the glove embedding technique. Inception ResNet version of Convolutional neural network (CNN) is used as an encoder. As a decoder, a gated recurrent unit is employed. The proposed model is tested on Flickr8k dataset and it can be seen that the results achieved through double awareness mechanism are highly effective.
Subject
Applied Mathematics,Algebra and Number Theory,Analysis