SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback-Reference-Cited by-同舟云学术

SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text Feedback

Published:2024-03-08 Issue:6 Volume:20 Page:1-17
ISSN:1551-6857
Container-title:ACM Transactions on Multimedia Computing, Communications, and Applications
language:en
Short-container-title:ACM Trans. Multimedia Comput. Commun. Appl.

Author:

Chen Yanzhe¹^ORCID,Zhou Jiahuan¹^ORCID,Peng Yuxin²^ORCID

Affiliation:

1. Wangxuan Institute of Computer Technology, Peking University, Beijing, China

2. Wangxuan Institute of Computer Technology, Peking University, Beijing, China and Peng Cheng Laboratorym, China

Abstract

Fashion image retrieval with text feedback aims to find the target image according to the reference image and the modification from the user. This is a challenging task, as it requires not only the synergistic understanding of both visual and textual modalities but also the ability to model a wide variety of styles that fashion images contain. Hence, the crucial aspect of addressing this problem lies in exploiting the abundant semantic information inherent in fashion images and correlating it with the textual description of style. Recognizing that style is generally situated at the local level, we explicitly define style as the commonalities and differences between local areas of fashion images. Building upon this, we propose a Style-guided Patch InteRaction approach for fashion Image retrieval with Text feedback (SPIRIT), which focuses on the decisive influence of local details of fashion images on their style. Three corresponding networks are designed pertinently. The Patch-level Style Commonality network is introduced to fully leverage the semantic information among patches and compute their average as the style commonality. Subsequently, the Patch-level Style Difference network employs a graph reasoning network to model the patch-level difference and filter out insignificant patches. By considering the above two networks, mutual information about style is obtained from the interaction between patches. Finally, the Visual Textual Fusion network is utilized to integrate visual features with rich semantic information and textual features. Experimental results on four benchmark datasets demonstrate that our proposed SPIRIT achieves state-of-the-art performance. Source code is available at https://github.com/PKU-ICST-MIPL/SPIRIT_TOMM2024 .

Funder

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3640345

Reference64 articles.

1. Image2StyleGAN: How to Embed Images Into the StyleGAN Latent Space?

2. Compositional Learning of Image-Text Query for Image Retrieval

3. Alberto Baldrati Lorenzo Agnolucci Marco Bertini and Alberto Del Bimbo. 2023. Zero-shot composed image retrieval with textual inversion. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 15338–15347.

4. Alberto Baldrati, Marco Bertini, Tiberio Uricchio, and Alberto Del Bimbo. 2022. Conditioned and composed image retrieval combining and partially fine-tuning CLIP-based features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4959–4968.

5. Effective conditioned and composed image retrieval combining CLIP-based features

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image Retrieval;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10