S2 Transformer for Image Captioning-Reference-Cited by-同舟云学术

S2 Transformer for Image Captioning

Published:2022-07 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Zeng Pengpeng¹,Zhang Haonan¹,Song Jingkuan¹,Gao Lianli¹

Affiliation:

1. University of Electronic Science and Technology of China

Abstract

Transformer-based architectures with grid features represent the state-of-the-art in visual and language reasoning tasks, such as visual question answering and image-text matching. However, directly applying them to image captioning may result in spatial and fine-grained semantic information loss. Their applicability to image captioning is still largely under-explored. Towards this goal, we propose a simple yet effective method, Spatial- and Scale-aware Transformer (S2 Transformer) for image captioning. Specifically, we firstly propose a Spatial-aware Pseudo-supervised (SP) module, which resorts to feature clustering to help preserve spatial information for grid features. Next, to maintain the model size and produce superior results, we build a simple weighted residual connection, named Scale-wise Reinforcement (SR) module, to simultaneously explore both low- and high-level encoded features with rich semantics. Extensive experiments on the MSCOCO benchmark demonstrate that our method achieves new state-of-art performance without bringing excessive parameters compared with the vanilla transformer. The source code is available at https://github.com/zchoi/S2-Transformer

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Unsupervised disease tags for automatic radiology report generation;Biomedical Signal Processing and Control;2024-03

2. Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning;2023 IEEE International Conference on Big Data (BigData);2023-12-15

3. A Doctors Behavior Aware and Domain Knowledge Driven Model for Medical Report Generation;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05

4. PSNet: position-shift alignment network for image caption;International Journal of Multimedia Information Retrieval;2023-11-27

5. CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning;Proceedings of the 31st ACM International Conference on Multimedia;2023-10-26