Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching-Reference-Cited by-同舟云学术

Breaking Through the Noisy Correspondence: A Robust Model for Image-Text Matching

Published:2024-08-19 Issue:6 Volume:42 Page:1-26
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Shi Haitao¹^ORCID,Liu Meng²^ORCID,Mu Xiaoxuan¹^ORCID,Song Xuemeng³^ORCID,Hu Yupeng¹^ORCID,Nie Liqiang⁴^ORCID

Affiliation:

1. Shandong University, Jinan, China

2. Shandong Jianzhu University, Jinan, China

3. Shandong University, Qingdao, China

4. Harbin Institute of Technology (Shenzhen), Shenzhen, China

Abstract

Unleashing the power of image-text matching in real-world applications is hampered by noisy correspondence. Manually curating high-quality datasets is expensive and time-consuming, and datasets generated using diffusion models are not adequately well-aligned. The most promising way is to collect image-text pairs from the Internet, but it will inevitably introduce noisy correspondence. To reduce the negative impact of noisy correspondence, we propose a novel model that first transforms the noisy correspondence filtering problem into a similarity distribution modeling problem by exploiting the powerful capabilities of pre-trained models. Specifically, we use the Gaussian Mixture model to model the similarity obtained by CLIP as clean distribution and noisy distribution, to filter out most of the noisy correspondence in the dataset. Afterward, we used relatively clean data to fine-tune the model. To further reduce the negative impact of unfiltered noisy correspondence, i.e., a minimal part where two distributions intersect during the fine-tuning process, we propose a distribution-sensitive dynamic margin ranking loss, further increasing the distance between the two distributions. Through continuous iteration, the noisy correspondence gradually decreases and the model performance gradually improves. Our extensive experiments demonstrate the effectiveness and robustness of our model even under high noise rates.

Funder

National Natural Science Foundation of China

Shandong Provincial Natural Science Foundation

Science and Technology Innovation Program for Distinguished Young Scholars of Shandong Province Higher Education Institutions

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3662732

Reference61 articles.

1. Eric Arazo, Diego Ortego, Paul Albert, Noel E. O’Connor, and Kevin McGuinness. 2019. Unsupervised Label Noise Modeling and Loss Correction. In Proceedings of the International Conference on Machine Learning, Vol. 97. 312–321.

2. Global Relation-Aware Attention Network for Image-Text Retrieval

3. IMRAM: Iterative Matching With Recurrent Attention Memory for Cross-Modal Image-Text Retrieval

4. Cross-modal Graph Matching Network for Image-text Retrieval

5. Junyoung Chung Caglar Gülcehre KyungHyun Cho and Yoshua Bengio. 2014. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv:2201.08239. Retrieved from https://arxiv.org/abs/1412.3555