Revisiting Hard Negative Mining in Contrastive Learning for Visual Understanding
-
Published:2023-12-04
Issue:23
Volume:12
Page:4884
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Zhang Hao1ORCID, Li Zheng1ORCID, Yang Jiahui1ORCID, Wang Xin1ORCID, Guo Caili1ORCID, Feng Chunyan1ORCID
Affiliation:
1. Beijing Key Laboratory of Network System Architecture and Convergence, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications, Beijing 100876, China
Abstract
Efficiently mining and distinguishing hard negatives is the key to Contrastive Learning (CL) in various visual understanding tasks. By properly emphasizing the penalty of hard negatives, Hard Negative Mining (HNM) can improve the CL performance. However, there is no method to quantitatively analyze the penalty strength of hard negatives, which makes training difficult to converge. In this paper, we propose a method for measuring and controlling the penalty strength. We first define a penalty strength metric to provides a quantitative analysis tool for HNM. Then, we propose a Triplet loss with Penalty Strength Control (T-PSC), which can balance the penalty strength of hard negatives and the difficulty of model optimization. In order to verify the effectiveness of the proposed T-PSC method in different modalities, we applied it to two visual understanding tasks: Image–Text Retrieval (ITR) for multi-model processing, and Temporal Action Localization (TAL) for video processing. T-PSC can be applied to existing ITR and TAL models in a plug-and-play manner without any changes. Experiments combined with existing models show that a reasonable control of the penalty strength can speed up training and improve the performance on higher-level tasks.
Funder
Key Program of National Natural Science Foundation of China
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Reference77 articles.
1. Schroff, F., Kalenichenko, D., and Philbin, J. (2015, January 7–12). FaceNet: A Unified Embedding for Face Recognition and Clustering. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA. 2. Khosla, P., Teterwak, P., Wang, C., Sarna, A., Tian, Y., Isola, P., Maschinot, A., Liu, C., and Krishnan, D. (2020, January 6–12). Supervised contrastive learning. Proceedings of the Advances in Neural Information Processing Systems, Virtual. 3. Luo, D., Liu, C., Zhou, Y., Yang, D., Ma, C., Ye, Q., and Wang, W. (2020, January 7–12). Video cloze procedure for self-supervised spatio-temporal learning. Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA. 4. He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. (2020, January 14–19). Momentum contrast for unsupervised visual representation learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual. 5. Fang, Z., Wang, J., Wang, L., Zhang, L., Yang, Y., and Liu, Z. (2021). Seed: Self-supervised distillation for visual representation. arXiv.
|
|