Soft Contrastive Cross-Modal Retrieval
-
Published:2024-02-27
Issue:5
Volume:14
Page:1944
-
ISSN:2076-3417
-
Container-title:Applied Sciences
-
language:en
-
Short-container-title:Applied Sciences
Author:
Song Jiayu1, Hu Yuxuan1, Zhu Lei2ORCID, Zhang Chengyuan3, Zhang Jian1, Zhang Shichao1ORCID
Affiliation:
1. School of Computer Science and Engineering, Central South University, Changsha 410083, China 2. College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China 3. College of Computer Science and Electronic Engineering, Hunan University, Changsha 410082, China
Abstract
Cross-modal retrieval plays a key role in the Natural Language Processing area, which aims to retrieve one modality to another efficiently. Despite the notable achievements of existing cross-modal retrieval methodologies, the complexity of the embedding space increases with more complex models, leading to less interpretable and potentially overfitting representations. Most existing methods realize outstanding results based on datasets without any error or noise, but that is extremely ideal and leads to trained models lacking robustness. To solve these problems, in this paper, we propose a novel approach, Soft Contrastive Cross-Modal Retrieval (SCCMR), which integrates the deep cross-modal model with soft contrastive learning and smooth label cross-entropy learning to boost common subspace embedding and improve the generalizability and robustness of the model. To confirm the performance and effectiveness of SCCMR, we conduct extensive experiments comparing 12 state-of-the-art methods on three multi-modal datasets by using image–text retrieval as a showcase. The experimental results show that our proposed method outperforms the baselines.
Funder
National Natural Science Foundation of China Natural Science Foundation of Hunan Province Scientific Research Project of Hunan Provincial Department of Education
Reference61 articles.
1. Zhen, L., Hu, P., Wang, X., and Peng, D. (2019, January 15–20). Deep supervised cross-modal retrieval. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA. 2. Cao, Y., Long, M., Wang, J., Yang, Q., and Yu, P.S. (2016, January 13–17). Deep visual-semantic hashing for cross-modal retrieval. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA. 3. On the role of Correlation and Abstraction in Cross-Modal Multimedia Retrieval;Coviello;Trans. Pattern Anal. Mach. Intell.,2014 4. Chen, Y., Yuan, J., Tian, Y., Geng, S., Li, X., Zhou, D., Metaxas, D.N., and Yang, H. (2023, January 18–22). Revisiting multimodal representation in contrastive learning: From patch and token embeddings to finite discrete tokens. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada. 5. Lin, Z., Bas, E., Singh, K.Y., Swaminathan, G., and Bhotika, R. (2023, January 3–7). Relaxing contrastiveness in multimodal representation learning. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA.
|
|