Attention-Aligned Transformer for Image Captioning-Reference-Cited by-同舟云学术

Attention-Aligned Transformer for Image Captioning

Published:2022-06-28 Issue:1 Volume:36 Page:607-615
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Fei Zhengcong

Abstract

Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued “deviated focus” problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A2 - an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based self-supervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A2 Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. HIST: Hierarchical and sequential transformer for image captioning;IET Computer Vision;2024-08-15

2. ControlCap: Controllable Captioning via No-Fuss Lexicon;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14

3. Impact of Language-Specific Training on Image Caption Synthesis: A Case Study on Low-Resource Assamese Language;International Journal of Asian Language Processing;2024-03

4. Image Captioning with Visual Positional Embedding and Bi-linear Pooling;Communications in Computer and Information Science;2024

5. Evaluate The Image Captioning Technique Using State-of-the-art, Attention And Non-Attention Models To Generate Human Like Captions;2023 16th International Conference on Developments in eSystems Engineering (DeSE);2023-12-18