Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval-Reference-Cited by-同舟云学术

Bridging Modalities: A Survey of Cross-Modal Image-Text Retrieval

Published:2024-06-12 Issue:1 Volume:1 Page:79-92
ISSN:2998-3371
Container-title:Chinese Journal of Information Fusion
language:en
Short-container-title:Chin. j. inf. fusion

Author:

Li Tieying¹^ORCID,Kong Lingdu¹^ORCID,Yang Xiaochun¹^ORCID,Wang Bin¹^ORCID,Xu Jiaxing²^ORCID

Affiliation:

1. School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China

2. School of Computer Science and Engineering, Nanyang Technological University, Singapore 639798, Singapore

Abstract

The rapid advancement of Internet technology, driven by social media and e-commerce platforms, has facilitated the generation and sharing of multimodal data, leading to increased interest in efficient cross-modal retrieval systems. Cross-modal image-text retrieval, encompassing tasks such as image query text (IqT) retrieval and text query image (TqI) retrieval, plays a crucial role in semantic searches across modalities. This paper presents a comprehensive survey of cross-modal image-text retrieval, addressing the limitations of previous studies that focused on single perspectives such as subspace learning or deep learning models. We categorize existing models into single-tower, dual-tower, real-value representation, and binary representation models based on their structure and feature representation. Additionally, we explore the impact of multimodal Large Language Models (MLLMs) on cross-modal retrieval. Our study also provides a detailed overview of common datasets, evaluation metrics, and performance comparisons of representative methods. Finally, we identify current challenges and propose future research directions to advance the field of cross-modal image-text retrieval.

Funder

National Natural Science Foundation of China

Publisher

Institute of Emerging and Computer Engineers Inc

Link

https://www.iece.org/filebob/uploads/storage/CJIF_k8UiIafD4xkk4zR.pdf

Reference39 articles.

1. Li, J., Li, D., Savarese, S., & Hoi, S. (2023, July). Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning (pp. 19730-19742). PMLR.

2. Zhang, P., Wang, X. D. B., Cao, Y., Xu, C., Ouyang, L., Zhao, Z., ... & Wang, J. (2023). Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112.

3. Zhu, H., Huang, J. H., Rudinac, S., & Kanoulas, E. (2024). Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models. arXiv preprint arXiv:2404.18746.

4. Li, Y., Wang, W., Qu, L., Nie, L., Li, W., & Chua, T. S. (2024). Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond. arXiv preprint arXiv:2402.10805.

5. Levy, M., Ben-Ari, R., Darshan, N., & Lischinski, D. (2024). Chatting makes perfect: Chat-based image retrieval. Advances in Neural Information Processing Systems, 36.