Abstract
In the era of multimedia big data, cross-modal retrieval has become an increasingly important research topic. This paper proposes a novel approach, named ``Mixing Auxiliary Cross-Modal Embedding method” (MACME), which straddles coarse granularity of global approaches and fine granularity of local approaches, and aims to bridge the modality gap between image and text modalities. Our method creates two new representations: IMAGEMIX and TEXTMIX, which are generated by replacing image regions with semantically similar text tokens and vice versa. Through extensive experiments on benchmark datasets, we demonstrate that MACME significantly improves retrieval accuracy compared to state-of-the-art methods. The source code and pre-trained models are available at https://github.com/nulixuesuanfa/MACME.