Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model-Reference-Cited by-同舟云学术

Optimizing Large Language Models on Multi-Core CPUs: A Case Study of the BERT Model

Published:2024-03-11 Issue:6 Volume:14 Page:2364
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Zhao Lanxin¹,Gao Wanrong²,Fang Jianbin²^ORCID

Affiliation:

1. School of International Business, Hunan University of Information Technology, Changsha 410151, China

2. School of Computer Science and Technology, National University of Defense Technology, Changsha 410073, China

Abstract

The BERT model is regarded as the cornerstone of various pre-trained large language models that have achieved promising results in recent years. This article investigates how to optimize the BERT model in terms of fine-tuning speed and prediction accuracy, aiming to accelerate the execution of the BERT model on a multi-core processor and improve its prediction accuracy in typical downstream natural language processing tasks. Our contributions are two-fold. First, we port and parallelize the fine-tuning training of the BERT model on a multi-core shared-memory processor. We port the BERT model onto a multi-core processor platform to accelerate the fine-tuning training process of the model for downstream tasks. Second, we improve the prediction performance of typical downstream natural language processing tasks through fine-tuning the model parameters. We select five typical downstream natural language processing tasks (CoLA, SST-2, MRPC, RTE, and WNLI) and perform optimization on the multi-core platform, taking the hyperparameters of batch size, learning rate, and training epochs into account. Our experimental results show that, by increasing the number of CPUs and the number of threads, the model training time can be significantly reduced. We observe that the reduced time is primarily concentrated in the self-attention mechanism. Our further experimental results show that setting reasonable hyperparameters can improve the accuracy of the BERT model when applied to downstream tasks and that appropriately increasing the batch size under conditions of sufficient computing resources can significantly reduce training time.

Funder

Social Science Fund of Hunan Province, China

National Natural Science Foundation of China

Publisher

MDPI AG

Link

https://www.mdpi.com/2076-3417/14/6/2364/pdf

Reference31 articles.

1. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing;Liu;ACM Comput. Surv.,2023

2. Distributed Representations of Sentences and Documents;Le;Proc. Mach. Learn. Res.,2014

3. Pennington, J., Socher, R., and Manning, C.D. (2014, January 25–29). Glove: Global Vectors for Word Representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2014, Doha, Qatar.

4. Dai, A.M., and Le, Q.V. (2015, January 7–12). Semi-supervised Sequence Learning. Proceedings of the 28th Conference and Workshop on Neural Information Processing Systems, Montreal, QC, Canada.

5. Long Short-Term Memory;Hochreiter;Neural Comput.,1997