Modality-Invariant Image-Text Embedding for Image-Sentence Matching-Reference-Cited by-同舟云学术

Modality-Invariant Image-Text Embedding for Image-Sentence Matching

Published:2019-02-28 Issue:1 Volume:15 Page:1-19
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Liu Ruoyu¹^ORCID,Zhao Yao¹,Wei Shikui¹,Zheng Liang²,Yang Yi³^ORCID

Affiliation:

1. Beijing Jiaotong University, Beijing, P. R., China

2. Australian National University, Australia

3. University of Technology Sydney, Ultimo NSW, Australia

Abstract

Performing direct matching among different modalities (like image and text) can benefit many tasks in computer vision, multimedia, information retrieval, and information fusion. Most of existing works focus on class-level image-text matching, called cross-modal retrieval , which attempts to propose a uniform model for matching images with all types of texts, for example, tags, sentences, and articles (long texts). Although cross-model retrieval alleviates the heterogeneous gap among visual and textual information, it can provide only a rough correspondence between two modalities. In this article, we propose a more precise image-text embedding method, image-sentence matching, which can provide heterogeneous matching in the instance level. The key issue for image-text embedding is how to make the distributions of the two modalities consistent in the embedding space. To address this problem, some previous works on the cross-model retrieval task have attempted to pull close their distributions by employing adversarial learning. However, the effectiveness of adversarial learning on image-sentence matching has not been proved and there is still not an effective method. Inspired by previous works, we propose to learn a modality-invariant image-text embedding for image-sentence matching by involving adversarial learning. On top of the triplet loss--based baseline, we design a modality classification network with an adversarial loss, which classifies an embedding into either the image or text modality. In addition, the multi-stage training procedure is carefully designed so that the proposed network not only imposes the image-text similarity constraints by ground-truth labels, but also enforces the image and text embedding distributions to be similar by adversarial learning. Experiments on two public datasets (Flickr30k and MSCOCO) demonstrate that our method yields stable accuracy improvement over the baseline model and that our results compare favorably to the state-of-the-art methods.

Funder

Fundamental Research Funds for the Central Universities

Natural Science Foundation of China

Joint Fund of Ministry of Education of China and China Mobile

National Key Research and Development of China

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3300939

Reference61 articles.

Cited by 28 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MIT-FRNet: Modality-invariant temporal representation learning-based feature reconstruction network for missing modalities;Expert Systems with Applications;2024-09

2. Collaborative group: Composed image retrieval via consensus learning from noisy annotations;Knowledge-Based Systems;2024-09

3. Multi-space channel representation learning for mono-to-binaural conversion based audio deepfake detection;Information Fusion;2024-05

4. Unsupervised Color Segmentation with Reconstructed Spatial Weighted Gaussian Mixture Model and Random Color Histogram;Computers, Materials & Continua;2024

5. Collaborative Group: Composed Image Retrieval Via Consensus Learning from Noisy Annotations;2024