Adaptive Text Denoising Network for Image Caption Editing-Reference-Cited by-同舟云学术

Adaptive Text Denoising Network for Image Caption Editing

Published:2023-02-03 Issue:1s Volume:19 Page:1-18
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Yuan Mengqi¹^ORCID,Bao Bing-Kun¹^ORCID,Tan Zhiyi¹^ORCID,Xu Changsheng²^ORCID

Affiliation:

1. Nanjing University of Postsand Telecommunications, Nanjing, China

2. Peng Cheng Laboratory; University of Chinese Academy of Sciences; NLPR, Institute of Automation, CAS, Beijing, China

Abstract

Image caption editing, which aims at editing the inaccurate descriptions of the images, is an interdisciplinary task of computer vision and natural language processing. As the task requires encoding the image and its corresponding inaccurate caption simultaneously and decoding to generate an accurate image caption, the encoder-decoder framework is widely adopted for image caption editing. However, existing methods mostly focus on the decoder, yet ignore a big challenge on the encoder: the semantic inconsistency between image and caption. To this end, we propose a novel A daptive T ext D enoising Net work (ATD-Net) to filter out noises at the word level and improve the model’s robustness at sentence level. Specifically, at the word level, we design a cross-attention mechanism called Textual Attention Mechanism (TAM), to differentiate the misdescriptive words. The TAM is designed to encode the inaccurate caption word by word based on the content of both image and caption. At the sentence level, in order to minimize the influence of misdescriptive words on the semantic of an entire caption, we introduce a Bidirectional Encoder to extract the correct semantic representation from the raw caption. The Bidirectional Encoder is able to model the global semantics of the raw caption, which enhances the robustness of the framework. We extensively evaluate our proposals on the MS-COCO image captioning dataset and prove the effectiveness of our method when compared with the state-of-the-arts.

Funder

National Key Research and Development Project

National Nature Science Foundation of China

Natural Science Foundation of Jiangsu Province

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3532627

Reference58 articles.

1. Show and tell: A neural image caption generator

2. A. Karpathy and L. Fei-Fei. 2017. Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39 4 (2017) 664–676.

3. K. Xu et al. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.

4. Self-Critical Sequence Training for Image Captioning

5. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-22

2. Sentiment-Oriented Transformer-Based Variational Autoencoder Network for Live Video Commenting;ACM Transactions on Multimedia Computing, Communications, and Applications;2024-01-11