The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model-Reference-Cited by-同舟云学术

The Contemporary Art of Image Search: Iterative User Intent Expansion via Vision-Language Model

Published:2024-04-17 Issue:CSCW1 Volume:8 Page:1-31
ISSN:2573-0142
Container-title:Proceedings of the ACM on Human-Computer Interaction
language:en
Short-container-title:Proc. ACM Hum.-Comput. Interact.

Author:

Ye Yilin¹^ORCID,Zhu Qian²^ORCID,Xiao Shishi³^ORCID,Zhang Kang¹^ORCID,Zeng Wei¹^ORCID

Affiliation:

1. The Hong Kong University of Science and Technology (Guangzhou) & The Hong Kong University of Science and Technology, Guangzhou, China

2. The Hong Kong University of Science and Technology, Hong Kong SAR, China

3. The Hong Kong University of Science and Technology (Guangzhou), Guangzhou, China

Abstract

Image search is an essential and user-friendly method to explore vast galleries of digital images. However, existing image search methods heavily rely on proximity measurements like tag matching or image similarity, requiring precise user inputs for satisfactory results. To meet the growing demand for a contemporary image search engine that enables accurate comprehension of users' search intentions, we introduce an innovative user intent expansion framework. Our framework leverages visual-language models to parse and compose multi-modal user inputs to provide more accurate and satisfying results. It comprises two-stage processes: 1) a parsing stage that incorporates a language parsing module with large language models to enhance the comprehension of textual inputs, along with a visual parsing module that integrates an interactive segmentation module to swiftly identify detailed visual elements within images; and 2) a logic composition stage that combines multiple user search intents into a unified logic expression for more sophisticated operations in complex searching scenarios. Moreover, the intent expansion framework enables users to perform flexible contextualized interactions with the search results to further specify or adjust their detailed search intents iteratively. We implemented the framework into an image search system for NFT (non-fungible token) search and conducted a user study to evaluate its usability and novel properties. The results indicate that the proposed framework significantly improves users' image search experience. Particularly the parsing and contextualized interactions prove useful in allowing users to express their search intents more accurately and engage in a more enjoyable iterative search experience.

Funder

Guangzhou Basic and Applied Basic Research Foundation

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3641019

Reference62 articles.

1. The Effects of System Initiative during Conversational Collaborative Search;Avula Sandeep;Proc. ACM CSCW,2022

2. Effective conditioned and composed image retrieval combining CLIP-based features

3. Keqin Bao, Jizhi Zhang, Yang Zhang, Wenjie Wang, Fuli Feng, and Xiangnan He. 2023. TALLRec: An Effective and Efficient Tuning Framework to Align Large Language Model with Recommendation. arXiv preprint arXiv:2305.00447 (2023).

4. SUS-A quick and dirty usability scale;John Brooke;Usability Evaluation in Industry,1996

5. InstructPix2Pix: Learning to Follow Image Editing Instructions