When Pairs Meet Triplets: Improving Low-Resource Captioning via Multi-Objective Optimization-Reference-Cited by-同舟云学术

When Pairs Meet Triplets: Improving Low-Resource Captioning via Multi-Objective Optimization

Published:2022-03-04 Issue:3 Volume:18 Page:1-20
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Wu Yike¹,Zhao Shiwan²,Zhang Ying¹,Yuan Xiaojie¹,Su Zhong²

Affiliation:

1. College of Computer Science, Nankai University, Tianjin, China

2. IBM Research - China, Beijing, China

Abstract

Image captioning for low-resource languages has attracted much attention recently. Researchers propose to augment the low-resource caption dataset into (image, rich-resource language, and low-resource language) triplets and develop the dual attention mechanism to exploit the existence of triplets in training to improve the performance. However, datasets in triplet form are usually small due to their high collecting cost. On the other hand, there are already many large-scale datasets, which contain one pair from the triplet, such as caption datasets in the rich-resource language and translation datasets from the rich-resource language to the low-resource language. In this article, we revisit the caption-translation pipeline of the translation-based approach to utilize not only the triplet dataset but also large-scale paired datasets in training. The caption-translation pipeline is composed of two models, one caption model of the rich-resource language and one translation model from the rich-resource language to the low-resource language. Unfortunately, it is not trivial to fully benefit from incorporating both the triplet dataset and paired datasets into the pipeline, due to the gap between the training and testing phases and the instability in the training process. We propose to jointly optimize the two models of the pipeline in an end-to-end manner to bridge the training and testing gap, and introduce two auxiliary training objectives to stabilize the training process. Experimental results show that the proposed method improves significantly over the state-of-the-art methods.

Funder

Chinese Scientific and Technical Innovation Project 2030

NSFC-Xinjiang Joint Fund

NSFC-General Technology Joint Fund for Basic Research

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3492325

Reference43 articles.

1. SPICE: Semantic Propositional Image Caption Evaluation

2. Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering

3. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations. 2015.

4. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the Acl Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. 65–72.

5. Say As You Wish: Fine-Grained Control of Image Caption Generation With Abstract Scene Graphs