Image Caption Generation Using Multi-Level Semantic Context Information-Reference-Cited by-同舟云学术

Image Caption Generation Using Multi-Level Semantic Context Information

Published:2021-06-30 Issue:7 Volume:13 Page:1184
ISSN:2073-8994
Container-title:Symmetry
language:en
Short-container-title:Symmetry

Author:

Tian Peng,Mo Hongwei,Jiang Laihao

Abstract

Object detection, visual relationship detection, and image captioning, which are the three main visual tasks in scene understanding, are highly correlated and correspond to different semantic levels of scene image. However, the existing captioning methods convert the extracted image features into description text, and the obtained results are not satisfactory. In this work, we propose a Multi-level Semantic Context Information (MSCI) network with an overall symmetrical structure to leverage the mutual connections across the three different semantic layers and extract the context information between them, to solve jointly the three vision tasks for achieving the accurate and comprehensive description of the scene image. The model uses a feature refining structure to mutual connections and iteratively updates the different semantic features of the image. Then a context information extraction network is used to extract the context information between the three different semantic layers, and an attention mechanism is introduced to improve the accuracy of image captioning while using the context information between the different semantic layers to improve the accuracy of object detection and relationship detection. Experiments on the VRD and COCO datasets demonstrate that our proposed model can leverage the context information between semantic layers to improve the accuracy of those visual tasks generation.

Publisher

MDPI AG

Subject

Physics and Astronomy (miscellaneous),General Mathematics,Chemistry (miscellaneous),Computer Science (miscellaneous)

Link

https://www.mdpi.com/2073-8994/13/7/1184/pdf

Reference56 articles.

1. Deep learning for visual understanding: A review;Yan;Neurocomputing,2016

2. Multimodal object description network for dense captioning

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MSAM: Deep Semantic Interaction Network for Visual Question Answering;Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering;2024

2. Deep image captioning: A review of methods, trends and future challenges;Neurocomputing;2023-08

3. Supervised Deep Learning Techniques for Image Description: A Systematic Review;Entropy;2023-03-23

4. Generating Human-Like Descriptions for the Given Image Using Deep Learning;ITM Web of Conferences;2023

5. Image captioning with residual swin transformer and Actor-Critic;Neural Computing and Applications;2022-10-05