Optimal Knowledge Distillation through Non-Heuristic Control of Dark Knowledge-Reference-Cited by-同舟云学术

Optimal Knowledge Distillation through Non-Heuristic Control of Dark Knowledge

Published:2024-08-22 Issue:3 Volume:6 Page:1921-1935
ISSN:2504-4990
Container-title:Machine Learning and Knowledge Extraction
language:en
Short-container-title:MAKE

Author:

Onchis Darian¹^ORCID,Istin Codruta²,Samuila Ioan¹^ORCID

Affiliation:

1. Department of Computer Science, West University of Timisoara, 300223 Timisoara, Romania

2. Department of Computer and Information Technology, Politehnica University of Timisoara, 300006 Timisoara, Romania

Abstract

In this paper, a method is introduced to control the dark knowledge values also known as soft targets, with the purpose of improving the training by knowledge distillation for multi-class classification tasks. Knowledge distillation effectively transfers knowledge from a larger model to a smaller model to achieve efficient, fast, and generalizable performance while retaining much of the original accuracy. The majority of deep neural models used for classification tasks append a SoftMax layer to generate output probabilities and it is usual to take the highest score and consider it the inference of the model, while the rest of the probability values are generally ignored. The focus is on those probabilities as carriers of dark knowledge and our aim is to quantify the relevance of dark knowledge, not heuristically as provided in the literature so far, but with an inductive proof on the SoftMax operational limits. These limits are further pushed by using an incremental decision tree with information gain split. The user can set a desired precision and an accuracy level to obtain a maximal temperature setting for a continual classification process. Moreover, by fitting both the hard targets and the soft targets, one obtains an optimal knowledge distillation effect that mitigates better catastrophic forgetting. The strengths of our method come from the possibility of controlling the amount of distillation transferred non-heuristically and the agnostic application of this model-independent study.

Publisher

MDPI AG

Link

https://www.mdpi.com/2504-4990/6/3/94/pdf

Reference35 articles.

1. Soulie, F.F., and Herault, J. (1990). Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition. Neurocomputing, Springer.

2. Touretzky, D.S. (1990). Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters. Advances in Neural Information Processing Systems 2, Morgan-Kaufmann.

3. (2020, October 17). ImageNet Large Scale Visual Recognition Challenge (ILSVRC). Available online: http://www.image-net.org/challenges/LSVRC/.

4. Do deep nets really need to be deep?;Ba;Adv. Neural Inf. Process. Syst.,2014

5. Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv.