Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction-Reference-Cited by-同舟云学术

Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction

Published:2023-02-08 Issue:1 Volume:10 Page:
ISSN:2196-1115
Container-title:Journal of Big Data
language:en
Short-container-title:J Big Data

Author:

Sasibhooshan Reshmi,Kumaraswamy Suresh,Sasidharan Santhoshkumar

Abstract

AbstractAutomatic caption generation with attention mechanisms aims at generating more descriptive captions containing coarser to finer semantic contents in the image. In this work, we use an encoder-decoder framework employing Wavelet transform based Convolutional Neural Network (WCNN) with two level discrete wavelet decomposition for extracting the visual feature maps highlighting the spatial, spectral and semantic details from the image. The Visual Attention Prediction Network (VAPN) computes both channel and spatial attention for obtaining visually attentive features. In addition to these, local features are also taken into account by considering the contextual spatial relationship between the different objects. The probability of the appropriate word prediction is achieved by combining the aforementioned architecture with Long Short Term Memory (LSTM) decoder network. Experiments are conducted on three benchmark datasets—Flickr8K, Flickr30K and MSCOCO datasets and the evaluation results prove the improved performance of the proposed model with CIDEr score of 124.2.

Publisher

Springer Science and Business Media LLC

Subject

Information Systems and Management,Computer Networks and Communications,Hardware and Architecture,Information Systems

Link

https://link.springer.com/content/pdf/10.1186/s40537-023-00693-9.pdf

Reference62 articles.

1. Li S, Kulkarni G, Berg TL, Berg AC, Choi Y. Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, 2011, pp. 220–228

2. Lin D. An information-theoretic definition of similarity. In: Proceedings of the Fifteenth International Conference on Machine Learning, 1998, pp. 296–304

3. Vinyals O, Toshev A, Bengio S, Erhan D. Show and tell: A neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3156–3164.

4. Jing Z, Kangkang L, Zhe W. Parallel-fusion lstm with synchronous semantic and visual information for image captioning. J Vis Commun Image Represent. 2021;75(8): 103044.

5. Jia X, Gavves E, Fernando B, Tuytelaars T. Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A novel image captioning model with visual-semantic similarities and visual representations re-weighting;Journal of King Saud University - Computer and Information Sciences;2024-09

2. ICEAP: An advanced fine-grained image captioning network with enhanced attribute predictor;Displays;2024-09

3. Optimizing image captioning: The effectiveness of vision transformers and VGG networks for remote sensing;Big Data Research;2024-08

4. Knowledge Enhancement and Optimization Strategies for Remote Sensing Image Captioning Using Contrastive Language Image Pre-training and Large Language Models;2024 International Conference on Machine Intelligence and Digital Applications;2024-05-30

5. Attribute guided fusion network for obtaining fine-grained image captions;Multimedia Tools and Applications;2024-05-27