DE-MKD: Decoupled Multi-Teacher Knowledge Distillation Based on Entropy
-
Published:2024-05-27
Issue:11
Volume:12
Page:1672
-
ISSN:2227-7390
-
Container-title:Mathematics
-
language:en
-
Short-container-title:Mathematics
Author:
Cheng Xin1ORCID, Zhang Zhiqiang2, Weng Wei3ORCID, Yu Wenxin2, Zhou Jinjia1ORCID
Affiliation:
1. Graduate School of Science and Engineering, Hosei University, Tokyo 184-8584, Japan 2. School of Science and Technology, Southwest University of Science and Technology, Mianyang 621010, China 3. Institute of Liberal Arts and Science, Kanazawa University, Kanazawa City 920-1192, Japan
Abstract
The complexity of deep neural network models (DNNs) severely limits their application on devices with limited computing and storage resources. Knowledge distillation (KD) is an attractive model compression technology that can effectively alleviate this problem. Multi-teacher knowledge distillation (MKD) aims to leverage the valuable and diverse knowledge distilled by multiple teacher networks to improve the performance of the student network. Existing approaches typically rely on simple methods such as averaging the prediction logits or using sub-optimal weighting strategies to fuse distilled knowledge from multiple teachers. However, employing these techniques cannot fully reflect the importance of teachers and may even mislead student’s learning. To address this issue, we propose a novel Decoupled Multi-Teacher Knowledge Distillation based on Entropy (DE-MKD). DE-MKD decouples the vanilla knowledge distillation loss and assigns adaptive weights to each teacher to reflect its importance based on the entropy of their predictions. Furthermore, we extend the proposed approach to distill the intermediate features from multiple powerful but cumbersome teachers to improve the performance of the lightweight student network. Extensive experiments on the publicly available CIFAR-100 image classification benchmark dataset with various teacher-student network pairs demonstrated the effectiveness and flexibility of our approach. For instance, the VGG8|ShuffleNetV2 model trained by DE-MKD reached 75.25%|78.86% top-one accuracy when choosing VGG13|WRN40-2 as the teacher, setting new performance records. In addition, surprisingly, the distilled student model outperformed the teacher in both teacher-student network pairs.
Reference35 articles.
1. A fast learning algorithm for deep belief nets;Hinton;Neural Comput.,2006 2. He, K., Zhang, X., Ren, S., and Sun, J. (2016, January 27–30). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA. 3. Hu, J., Shen, L., and Sun, G. (2018, January 18–22). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA. 4. Ma, N., Zhang, X., Zheng, H.T., and Sun, J. (2018, January 8–14). Shufflenet v2: Practical guidelines for efficient cnn architecture design. Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany. 5. Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst., 28.
|
|