PET: Parameter-efficient Knowledge Distillation on Transformer-Reference-Cited by-同舟云学术

PET: Parameter-efficient Knowledge Distillation on Transformer

Published:2023-07-06 Issue:7 Volume:18 Page:e0288060
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Jeon Hyojin^ORCID,Park Seungcheol,Kim Jin-Gee,Kang U.^ORCID

Abstract

Given a large Transformer model, how can we obtain a small and computationally efficient model which maintains the performance of the original model? Transformer has shown significant performance improvements for many NLP tasks in recent years. However, their large size, expensive computational cost, and long inference time make it challenging to deploy them to resource-constrained devices. Existing Transformer compression methods mainly focus on reducing the size of the encoder ignoring the fact that the decoder takes the major portion of the long inference time. In this paper, we propose PET (Parameter-Efficient knowledge distillation on Transformer), an efficient Transformer compression method that reduces the size of both the encoder and decoder. In PET, we identify and exploit pairs of parameter groups for efficient weight sharing, and employ a warm-up process using a simplified task to increase the gain through Knowledge Distillation. Extensive experiments on five real-world datasets show that PET outperforms existing methods in machine translation tasks. Specifically, on the IWSLT’14 EN→DE task, PET reduces the memory usage by 81.20% and accelerates the inference speed by 45.15% compared to the uncompressed model, with a minor decrease in BLEU score of 0.27.

Funder

Institute of Information communications Technology Planning & Evaluation

Artificial Intelligence Graduate School Program

Artificial Intelligence Innovation Hub

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference46 articles.

1. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is All you Need. In: Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA; 2017. p. 5998–6008.

2. Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: NAACL-HLT (1). Association for Computational Linguistics; 2019. p. 4171–4186.

3. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer;C Raffel;J Mach Learn Res,2020

4. Scao TL, Fan A, Akiki C, Pavlick E, Ilic S, Hesslow D, et al. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. CoRR. 2022;abs/2211.05100.

5. Wang Z, Li M, Xu R, Zhou L, Lei J, Lin X, et al. Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners. CoRR. 2022;abs/2205.10747.

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

2. Application of the transformer model algorithm in chinese word sense disambiguation: a case study in chinese language;Scientific Reports;2024-03-15

3. Heterogeneous Student Knowledge Distillation From BERT Using a Lightweight Ensemble Framework;IEEE Access;2024