Dual-path Convolutional Image-Text Embeddings with Instance Loss-Reference-Cited by-同舟云学术

Dual-path Convolutional Image-Text Embeddings with Instance Loss

Published:2020-05-31 Issue:2 Volume:16 Page:1-23
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Zheng Zhedong¹,Zheng Liang²,Garrett Michael³,Yang Yi¹^ORCID,Xu Mingliang⁴,Shen Yi-Dong⁵

Affiliation:

1. University of Technology Sydney, Ultimo NSW, Australia

2. The Australian National University, Australia

3. CingleVue International Australia and Edith Cowan University, Joondalup WA, Australia

4. Zhengzhou University, Zhengzhou, Henan, China

5. State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences, Beijing, China

Abstract

Matching images and sentences demands a fine understanding of both modalities. In this article, we propose a new system to discriminatively embed the image and text to a shared visual-textual space. In this field, most existing works apply the ranking loss to pull the positive image/text pairs close and push the negative pairs apart from each other. However, directly deploying the ranking loss on heterogeneous features (i.e., text and image features) is less effective, because it is hard to find appropriate triplets at the beginning. So the naive way of using the ranking loss may compromise the network from learning inter-modal relationship. To address this problem, we propose the instance loss, which explicitly considers the intra-modal data distribution. It is based on an unsupervised assumption that each image/text group can be viewed as a class. So the network can learn the fine granularity from every image/text group. The experiment shows that the instance loss offers better weight initialization for the ranking loss, so that more discriminative embeddings can be learned. Besides, existing works usually apply the off-the-shelf features, i.e., word2vec and fixed visual feature. So in a minor contribution, this article constructs an end-to-end dual-path convolutional network to learn the image and text representations. End-to-end learning allows the system to directly learn from the data and fully utilize the supervision. On two generic retrieval datasets (Flickr30k and MSCOCO), experiments demonstrate that our method yields competitive accuracy compared to state-of-the-art methods. Moreover, in language-based person retrieval, we improve the state of the art by a large margin. The code has been made publicly available.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications,Hardware and Architecture

Link

https://dl.acm.org/doi/pdf/10.1145/3383184

Reference85 articles.

1. Learning Aligned Cross-Modal Representations from Weakly Aligned Data

2. Event Extraction via Dynamic Multi-Pooling Convolutional Neural Networks

3. Paying More Attention to Saliency

4. Discriminative Dictionary Learning With Common Label Alignment for Cross-Modal Retrieval

Cited by 287 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. GADNet: Improving image–text matching via graph-based aggregation and disentanglement;Pattern Recognition;2025-01

2. Cross-modal semantic aligning and neighbor-aware completing for robust text–image person retrieval;Information Fusion;2024-12

3. Full-view salient feature mining and alignment for text-based person search;Expert Systems with Applications;2024-10

4. Enhanced taxonomic identification of fusulinid fossils through image–text integration using transformer;Computers & Geosciences;2024-10

5. Parallel weight control based on policy gradient of relation refinement for cross-modal retrieval;Engineering Applications of Artificial Intelligence;2024-10