Bi-Level Orthogonal Multi-Teacher Distillation
-
Published:2024-08-22
Issue:16
Volume:13
Page:3345
-
ISSN:2079-9292
-
Container-title:Electronics
-
language:en
-
Short-container-title:Electronics
Author:
Gong Shuyue1, Wen Weigang1
Affiliation:
1. School of Mechanical, Electronic and Control Engineering, Beijing Jiaotong University, Beijing 100044, China
Abstract
Multi-teacher knowledge distillation is a powerful technique that leverages diverse information sources from multiple pre-trained teachers to enhance student model performance. However, existing methods often overlook the challenge of effectively transferring knowledge to weaker student models. To address this limitation, we propose BOMD (Bi-level Optimization for Multi-teacher Distillation), a novel approach that combines bi-level optimization with multiple orthogonal projections. Our method employs orthogonal projections to align teacher feature representations with the student’s feature space while preserving structural properties. This alignment is further reinforced through a dedicated feature alignment loss. Additionally, we utilize bi-level optimization to learn optimal weighting factors for combining knowledge from heterogeneous teachers, treating the weights as upper-level variables and the student’s parameters as lower-level variables. Extensive experiments on multiple benchmark datasets demonstrate the effectiveness and flexibility of BOMD. Our method achieves state-of-the-art performance on the CIFAR-100 benchmark for multi-teacher knowledge distillation across diverse scenarios, consistently outperforming existing approaches. BOMD shows significant improvements for both homogeneous and heterogeneous teacher ensembles, even when distilling to compact student models.
Reference28 articles.
1. Dong, P., Niu, X., Li, L., Xie, L., Zou, W., Ye, T., Wei, Z., and Pan, H. (2022). Prior-Guided One-shot Neural Architecture Search. arXiv. 2. Dong, P., Li, L., Wei, Z., Niu, X., Tian, Z., and Pan, H. (2023, January 4–6). EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization. Proceedings of the International Conference on Computer Vision (ICCV), Paris, France. 3. Zhu, C., Li, L., Wu, Y., and Sun, Z. (2024, January 20–27). Saswot: Real-time semantic segmentation architecture search without training. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada. 4. Wei, Z., Dong, P., Hui, Z., Li, A., Li, L., Lu, M., Pan, H., and Li, D. (2024, January 20–27). Auto-prox: Training-free vision transformer architecture search via automatic proxy discovery. Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada. 5. Wei, Z., Pan, H., Li, L., Dong, P., Tian, Z., Niu, X., and Li, D. (2023). TVT: Training-Free Vision Transformer Search on Tiny Datasets. arXiv.
|
|