TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning-Reference-Cited by-同舟云学术

TCIC: Theme Concepts Learning Cross Language and Vision for Image Captioning

Published:2021-08 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Fan Zhihao¹,Wei Zhongyu¹,Wang Siyuan¹,Wang Ruize¹,Li Zejun¹,Shan Haijun²,Huang Xuanjing¹

Affiliation:

1. Fudan University

2. Zhejiang Lab

Abstract

Existing research for image captioning usually represents an image using a scene graph with low-level facts (objects and relations) and fails to capture the high-level semantics. In this paper, we propose a Theme Concepts extended Image Captioning (TCIC) framework that incorporates theme concepts to represent high-level cross-modality semantics. In practice, we model theme concepts as memory vectors and propose Transformer with Theme Nodes (TTN) to incorporate those vectors for image captioning. Considering that theme concepts can be learned from both images and captions, we propose two settings for their representations learning based on TTN. On the vision side, TTN is configured to take both scene graph based features and theme concepts as input for visual representation learning. On the language side, TTN is configured to take both captions and theme concepts as input for text representation re-construction. Both settings aim to generate target captions with the same transformer-based decoder. During the training, we further align representations of theme concepts learned from images and corresponding captions to enforce the cross-modality learning. Experimental results on MS COCO show the effectiveness of our approach compared to some state-of-the-art models.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 9 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. HIST: Hierarchical and sequential transformer for image captioning;IET Computer Vision;2024-08-15

2. MGTANet: Multi-Scale Guided Token Attention Network for Image Captioning;Proceedings of the 2024 3rd International Conference on Cyber Security, Artificial Intelligence and Digital Economy;2024-03

3. Image Captioning With Controllable and Adaptive Length Levels;IEEE Transactions on Pattern Analysis and Machine Intelligence;2024-02

4. Embedded Heterogeneous Attention Transformer for Cross-Lingual Image Captioning;IEEE Transactions on Multimedia;2024

5. Integrating grid features and geometric coordinates for enhanced image captioning;Applied Intelligence;2023-12-07