Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation-Reference-Cited by-同舟云学术

Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

Published:2021-08 Issue: Volume: Page:
ISSN:
Container-title:Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence
language:
Short-container-title:

Author:

Kim Taehyeon¹,Oh Jaehoon²,Kim Nak Yil¹,Cho Sangwook¹,Yun Se-Young¹

Affiliation:

1. Graduate School of Artificial Intelligence, KAIST

2. Graduate School of Knowledge Service Engineering, KAIST

Abstract

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter τ. Despite its widespread use, few studies have discussed how such softening influences generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when τ increases and the label matching when τ goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the penultimate layer representations difference between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, using the KL divergence loss with small τ particularly, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

Publisher

International Joint Conferences on Artificial Intelligence Organization

Cited by 74 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A comprehensive overview of graph neural network-based approaches to clustering for spatial transcriptomics;Computational and Structural Biotechnology Journal;2024-12

2. SeDPGK: Semi-supervised software defect prediction with graph representation learning and knowledge distillation;Information and Software Technology;2024-10

3. Backward induction-based deep image search;PLOS ONE;2024-09-09

4. PanDa: Prompt Transfer Meets Knowledge Distillation for Efficient Model Adaptation;IEEE Transactions on Knowledge and Data Engineering;2024-09

5. Lightweight Brain Tumor Diagnosis via Knowledge Distillation;2024 International Conference on Multimedia Analysis and Pattern Recognition (MAPR);2024-08-15