Straddling Coarse And Fine Granularity: Mixing Auxiliary Cross-Modal Image-Text Retrieval-Reference-Cited by-同舟云学术

Straddling Coarse And Fine Granularity: Mixing Auxiliary Cross-Modal Image-Text Retrieval

Published:2024-07-01 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Lu Zheng¹

Affiliation:

1. Sichuan University

Abstract

In the era of multimedia big data, cross-modal retrieval has become an increasingly important research topic. This paper proposes a novel approach, named ``Mixing Auxiliary Cross-Modal Embedding method” (MACME), which straddles coarse granularity of global approaches and fine granularity of local approaches, and aims to bridge the modality gap between image and text modalities. Our method creates two new representations: IMAGEMIX and TEXTMIX, which are generated by replacing image regions with semantically similar text tokens and vice versa. Through extensive experiments on benchmark datasets, we demonstrate that MACME significantly improves retrieval accuracy compared to state-of-the-art methods. The source code and pre-trained models are available at https://github.com/nulixuesuanfa/MACME.

Publisher

Springer Science and Business Media LLC

Reference30 articles.

1. Faghri, Fartash and Fleet, David J and Kiros, Jamie Ryan and Fidler, Sanja (2017) Vse + +: Improving visual-semantic embeddings with hard negatives. arXiv preprint arXiv:1707.05612

2. Chen, Jiacheng and Hu, Hexiang and Wu, Hao and Jiang, Yuning and Wang, Changhu (2021) Learning the best pooling strategy for visual semantic embedding. 15789--15798, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

3. Lee, Kuang-Huei and Chen, Xi and Hua, Gang and Hu, Houdong and He, Xiaodong (2018) Stacked cross attention for image-text matching. 201--216, Proceedings of the European conference on computer vision (ECCV)

4. Diao, Haiwen and Zhang, Ying and Ma, Lin and Lu, Huchuan (2021) Similarity reasoning and filtration for image-text matching. 1218--1226, 2, 35, Proceedings of the AAAI conference on artificial intelligence

5. Hotelling, Harold (1935) The most predictable criterion.. Journal of educational Psychology 26(2): 139 Warwick & York