Image–Text Cross-Modal Retrieval with Instance Contrastive Embedding
-
Published:2024-01-09
Issue:2
Volume:13
Page:300
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Zeng Ruigeng1ORCID, Ma Wentao2, Wu Xiaoqian2, Liu Wei3ORCID, Liu Jie1
Affiliation:
1. Science and Technology on Parallel and Distributed Processing Laboratory, National University of Defense Technology, Changsha 410073, China 2. School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei 230036, China 3. School of Management Science and Engineering, Anhui University of Finance & Economics, Bengbu 233030, China
Abstract
Image–text cross-modal retrieval aims to bridge the semantic gap between different modalities, allowing for the search of images based on textual descriptions or vice versa. Existing efforts in this field concentrate on coarse-grained feature representation and then utilize pairwise ranking loss to pull image–text positive pairs closer, pushing negative ones apart. However, using pairwise ranking loss directly on coarse-grained representation lacks reliability as it disregards fine-grained information, posing a challenge in narrowing the semantic gap between image and text. To this end, we propose an Instance Contrastive Embedding (IConE) method for image–text cross-modal retrieval. Specifically, we first transfer the multi-modal pre-training model to the cross-modal retrieval task to leverage the interactive information between image and text, thereby enhancing the model’s representational capabilities. Then, to comprehensively consider the feature distribution of intra- and inter-modality, we design a novel two-stage training strategy that combines instance loss and contrastive loss, dedicated to extracting fine-grained representation within instances and bridging the semantic gap between modalities. Extensive experiments on two public benchmark datasets, Flickr30k and MS-COCO, demonstrate that our IConE outperforms several state-of-the-art (SoTA) baseline methods and achieves competitive performance.
Funder
National Natural Science Foundation of China National Key Research and Development Program of China
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference49 articles.
1. Ma, Y., Xu, G., Sun, X., Yan, M., Zhang, J., and Ji, R. (2022, January 10–14). X-CLIP: End-to-end multi-grained contrastive learning for video-text retrieval. Proceedings of the ACM International Conference on Multimedia, Lisbon, Portugal. 2. Using Multimodal Contrastive Knowledge Distillation for Video-Text Retrieval;Ma;IEEE Trans. Circuits Syst. Video Technol.,2023 3. Chen, S., Zhao, Y., Jin, Q., and Wu, Q. (2020, January 13–19). Fine-grained video-text retrieval with hierarchical graph reasoning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA. 4. Ma, W., Wu, X., Zhao, S., Zhou, T., Guo, D., Gu, L., Cai, Z., and Wang, M. (IEEE Trans. Multimed., 2023). FedSH: Towards Privacy-preserving Text-based Person Re-Identification, IEEE Trans. Multimed., early access. 5. Wu, X., Ma, W., Guo, D., Tongqing, Z., Zhao, S., and Cai, Z. (2024, January 20–27). Text-based Occluded Person Re-identification via Multi-Granularity Contrastive Consistency Learning. Proceedings of the AAAI, Vancouver, BC, Canada.
|
|