Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT-Reference-Cited by-同舟云学术

Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT

Published:2020-04-03 Issue:05 Volume:34 Page:8815-8821
ISSN:2374-3468
Container-title:Proceedings of the AAAI Conference on Artificial Intelligence
language:
Short-container-title:AAAI

Author:

Shen Sheng,Dong Zhen,Ye Jiayu,Ma Linjian,Yao Zhewei,Gholami Amir,Mahoney Michael W.,Keutzer Kurt

Abstract

Transformer based architectures have become de-facto models used for a range of Natural Language Processing tasks. In particular, the BERT based models achieved significant accuracy gain for GLUE tasks, CoNLL-03 and SQuAD. However, BERT based models have a prohibitive memory footprint and latency. As a result, deploying BERT based models in resource constrained environments has become a challenging task. In this work, we perform an extensive analysis of fine-tuned BERT models using second order Hessian information, and we use our results to propose a novel method for quantizing BERT models to ultra low precision. In particular, we propose a new group-wise quantization scheme, and we use Hessian-based mix-precision method to compress the model further. We extensively test our proposed method on BERT downstream tasks of SST-2, MNLI, CoNLL-03, and SQuAD. We can achieve comparable performance to baseline with at most 2.3% performance degradation, even with ultra-low precision quantization down to 2 bits, corresponding up to 13× compression of the model parameters, and up to 4× compression of the embedding table as well as activations. Among all tasks, we observed the highest performance loss for BERT fine-tuned on SQuAD. By probing into the Hessian based analysis as well as visualization, we show that this is related to the fact that current training/fine-tuning strategy of BERT does not converge for SQuAD.

Publisher

Association for the Advancement of Artificial Intelligence (AAAI)

Subject

General Medicine

Cited by 104 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Trainable pruned ternary quantization for medical signal classification models;Neurocomputing;2024-10

2. LiRank: Industrial Large Scale Ranking Models at LinkedIn;Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining;2024-08-24

3. Layerwised multimodal knowledge distillation for vision-language pretrained model;Neural Networks;2024-07

4. From Static to Dynamic: A Deeper, Faster, and Adaptive Language Modeling Approach;2024 International Joint Conference on Neural Networks (IJCNN);2024-06-30

5. Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29