Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval-Reference-Cited by-同舟云学术

Cross-Modal Attention Preservation with Self-Contrastive Learning for Composed Query-Based Image Retrieval

Published:2024-03-08 Issue:6 Volume:20 Page:1-22
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Li Shenshen¹^ORCID,Xu Xing¹^ORCID,Jiang Xun¹^ORCID,Shen Fumin¹^ORCID,Sun Zhe²^ORCID,Cichocki Andrzej³^ORCID

Affiliation:

1. School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, China

2. Juntendo University, Tokyo, Japan

3. Systems Research Institute of Polish Academy of Science, Warszawa, Poland and Tensor Learning Lab, Riken AIP, Tokyo, Japan

Abstract

In this article, we study the challenging cross-modal image retrieval task, Composed Query-Based Image Retrieval (CQBIR) , in which the query is not a single text query but a composed query, i.e., a reference image, and a modification text. Compared with the conventional cross-modal image-text retrieval task, the CQBIR is more challenging as it requires properly preserving and modifying the specific image region according to the multi-level semantic information learned from the multi-modal query. Most recent works focus on extracting preserved and modified information and compositing it into a unified representation. However, we observe that the preserved regions learned by the existing methods contain redundant modified information, inevitably degrading the overall retrieval performance. To this end, we propose a novel method termed C ross- M odal A ttention P reservation (CMAP) . Specifically, we first leverage the cross-level interaction to fully account for multi-granular semantic information, which aims to supplement the high-level semantics for effective image retrieval. Furthermore, different from conventional contrastive learning, our method introduces self-contrastive learning into learning preserved information, to prevent the model from confusing the attention for the preserved part with the modified part. Extensive experiments on three widely used CQBIR datasets, i.e., FashionIQ, Shoes, and Fashion200k, demonstrate that our proposed CMAP method significantly outperforms the current state-of-the-art methods on all the datasets. The anonymous implementation code of our CMAP method is available at https://github.com/CFM-MSG/Code_CMAP.

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3639469

Reference52 articles.

1. Compositional Learning of Image-Text Query for Image Retrieval

2. Effective conditioned and composed image retrieval combining CLIP-based features

3. Automatic Attribute Discovery and Characterization from Noisy Web Data

4. Image Retrieval for Complex Queries Using Knowledge Embedding

5. Image Search With Text Feedback by Visiolinguistic Attention Learning