Revisiting Hard Negative Mining in Contrastive Learning for Visual Understanding-Reference-Cited by-同舟云学术

Revisiting Hard Negative Mining in Contrastive Learning for Visual Understanding

Published:2023-12-04 Issue:23 Volume:12 Page:4884
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Zhang Hao¹^ORCID,Li Zheng¹^ORCID,Yang Jiahui¹^ORCID,Wang Xin¹^ORCID,Guo Caili¹^ORCID,Feng Chunyan¹^ORCID

Affiliation:

1. Beijing Key Laboratory of Network System Architecture and Convergence, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China

Abstract

Efficiently mining and distinguishing hard negatives is the key to Contrastive Learning (CL) in various visual understanding tasks. By properly emphasizing the penalty of hard negatives, Hard Negative Mining (HNM) can improve the CL performance. However, there is no method to quantitatively analyze the penalty strength of hard negatives, which makes training difficult to converge. In this paper, we propose a method for measuring and controlling the penalty strength. We first define a penalty strength metric to provides a quantitative analysis tool for HNM. Then, we propose a Triplet loss with Penalty Strength Control (T-PSC), which can balance the penalty strength of hard negatives and the difficulty of model optimization. In order to verify the effectiveness of the proposed T-PSC method in different modalities, we applied it to two visual understanding tasks: Image–Text Retrieval (ITR) for multi-model processing, and Temporal Action Localization (TAL) for video processing. T-PSC can be applied to existing ITR and TAL models in a plug-and-play manner without any changes. Experiments combined with existing models show that a reasonable control of the penalty strength can speed up training and improve the performance on higher-level tasks.

Funder

Key Program of National Natural Science Foundation of China

Publisher

MDPI AG

Subject

Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering

Link

https://www.mdpi.com/2079-9292/12/23/4884/pdf

Reference77 articles.

1. Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7–12). FaceNet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA.

2. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. (2020, January 6–12). Supervised contrastive learning. Proceedings of the Advances in Neural Information Processing Systems, Virtual.

3. Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., and Wang, W. (2020, January 7–12). Video cloze procedure for self-supervised spatio-temporal learning. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA.

4. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020, January 14–19). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual.

5. Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., and Liu, Z. (2021). Seed: Self-supervised distillation for visual representation. arXiv.