Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction-Reference-Cited by-同舟云学术

Large Language Model Inference Acceleration Based on Hybrid Model Branch Prediction

Published:2024-04-05 Issue:7 Volume:13 Page:1376
ISSN:2079-9292
Container-title:Electronics
language:en
Short-container-title:Electronics

Author:

Duan Gaoxiang¹²,Chen Jiajie¹²,Zhou Yueying¹²^ORCID,Zheng Xiaoying¹²,Zhu Yongxin¹²^ORCID

Affiliation:

1. Shanghai Advanced Research Institute, Chinese Academy of Sciences, Shanghai 201210, China

2. University of Chinese Academy of Sciences, Beijing 100049, China

Abstract

As the size of deep learning models continues to expand, the elongation of inference time has gradually evolved into a significant challenge to efficiency and practicality for autoregressive models. This work introduces a hybrid model acceleration strategy based on branch prediction, which accelerates autoregressive model inference without requiring retraining and ensures output consistency with the original model. Specifically, the algorithm employs two models with different parameter sizes aimed at the same task. The smaller model generates a series of potential tokens that are then parallelly validated by the larger model to determine their acceptability. By orchestrating the workflow of the large and small models through a branch-prediction strategy, the algorithm conceals the validation time of the larger model when predictions are successful, thereby accelerating inference. We propose a binomial distribution-based prediction function that blends theoretical principles with empirical evidence, specifically designed for the nuanced requirements of accelerating inference within a hybrid model framework. The entire algorithm was designed and implemented on the llama model for text generation and translation tasks. The experimental results indicate significant improvements. The proposed algorithm achieves a 1.2× to 3.4× increase in inference speed compared to the original model, consistently outperforming the speculative sampling inference acceleration algorithm.

Funder

National Natural Science Foundation of China

National SKA Program of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2079-9292/13/7/1376/pdf

Reference21 articles.

1. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Adv. Neural Inf. Process. Syst., 30.

2. Kenton, J.D.M.W.C., and Toutanova, L.K. (2019, January 2–7). Bert: Pre-training of deep bidirectional transformers for language understanding. Proceedings of the naacL-HLT, Minneapolis, MN, USA.

3. Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., and Azhar, F. (2023). Llama: Open and efficient foundation language models. arXiv.

4. Liu, Y., Pan, D., Zhang, H., and Zhong, K. (2023). Degradation-Trend-Aware Deep Neural Network with Attention Mechanism for Bearing Remaining Useful Life Prediction. IEEE Trans. Artif. Intell., 1–15.

5. Kitaev, N., Kaiser, Ł., and Levskaya, A. (2020). Reformer: The efficient transformer. arXiv.