Text Embedding Bank for Detailed Image Paragraph Captioning-Reference-Cited by-同舟云学术

Text Embedding Bank for Detailed Image Paragraph Captioning

Published:2021-05-18 Issue:18 Volume:35 Page:15791-15792
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Gupta Arjun,Shen Zengming,Huang Thomas

Abstract

Existing deep learning-based models for image captioning typically consist of an image encoder to extract visual features and a language model decoder, an architecture that has shown promising results in single high-level sentence generation. However, only the word-level guiding signal is available when the image encoder is optimized to extract visual features. The inconsistency between the parallel extraction of visual features and sequential text supervision limits its success when the length of the generated text is long (more than 50 words). We propose a new module, called the Text Embedding Bank (TEB), to address this problem for image paragraph captioning. This module uses the paragraph vector model to learn fixed-length feature representations from a variable-length paragraph. We refer to the fixed-length feature as the TEB. This TEB module plays two roles to benefit paragraph captioning performance. First, it acts as a form of global and coherent deep supervision to regularize visual feature extraction in the image encoder. Second, it acts as a distributed memory to provide features of the whole paragraph to the language model, which alleviates the long-term dependency problem. Adding this module to two existing state-of-the-art methods achieves a new state-of-the-art result on the paragraph captioning Stanford Visual Genome dataset.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Look and Review, Then Tell: Generate More Coherent Paragraphs from Images by Fusing Visual and Textual Information;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

2. Crowded pose-guided multi-task learning for instance-level human parsing;Machine Vision and Applications;2023-05-05