Knowledge Distillation from Internal Representations-Reference-Cited by-同舟云学术

Knowledge Distillation from Internal Representations

Published:2020-04-03 Issue:05 Volume:34 Page:7350-7357
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Aguilar Gustavo,Ling Yuan,Zhang Yu,Yao Benjamin,Fan Xing,Guo Chenlei

Abstract

Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 54 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Computer Vision Model Compression Techniques for Embedded Systems:A Survey;Computers & Graphics;2024-10

2. Multi-label category enhancement fusion distillation based on variational estimation;Knowledge-Based Systems;2024-09

3. Relation-Based Multi-Teacher Knowledge Distillation;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

4. Adaptive Cross-Architecture Mutual Knowledge Distillation;2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG);2024-05-27

5. Cross-Architecture Knowledge Distillation;International Journal of Computer Vision;2024-02-19