Affiliation:
1. College of Computer Science and Technology (College of Data Science), Taiyuan University of Technology, Taiyuan, China
2. Computer Science, The University of Texas at Dallas, Richardson, United States
3. University of Texas at Dallas, Richardson, United States
Abstract
Large Language Models (LLMs) have received much recent attention due to their human-level accuracy. While existing works mostly focus on either improving accuracy or testing accuracy robustness, the computation efficiency of LLMs, which is of paramount importance due to often vast generation demands and real-time requirements, has surprisingly received little attention. In this article, we make the first attempt to understand and test potential computation efficiency robustness in state-of-the-art LLMs. By analyzing the working mechanism and implementation of 20,543 public-accessible LLMs, we observe a fundamental property in LLMs that could be manipulated in an adversarial manner to reduce computation efficiency significantly. Our interesting observation is that the output length determines the computation efficiency of LLMs instead of the input, where the output length depends on two factors: an often sufficiently large yet pessimistic pre-configured threshold controlling the max number of iterations and a runtime-generated end of sentence (EOS) token. Our key motivation is to generate test inputs that could sufficiently delay the generation of EOS such that LLMs would have to go through enough iterations to satisfy the pre-configured threshold. We present
LLMEffiChecker
, which can work under both white-box setting and black-box setting. In the white-box scenario,
LLMEffiChecker
develops a gradient-guided technique that searches for a minimal and unnoticeable perturbation at character-level, token-level, and structure-level. In the black-box scenario,
LLMEffiChecker
employs a causal inference-based approach to find critical tokens and similarly applies three levels of imperceptible perturbation to them. Both the white-box and black-box settings effectively delay the appearance of EOS, compelling these inputs to reach the naturally unreachable threshold. To demonstrate the effectiveness of
LLMEffiChecker
, we conduct a systematic evaluation on nine publicly available LLMs: Google T5, AllenAI WMT14, Helsinki-NLP translator, Facebook FairSeq, UNICAMP-DL translator, MarianMT, Google FLAN-T5, MBZUAI LaMini-GPT, and Salesforce CodeGen. Experimental results show that
LLMEffiChecker
can increase on average LLMs’ response latency and energy consumption by 325% to 3,244% and 344% to 3,616%, respectively, by perturbing just one character or token in the input sentence. Our case study shows that inputs generated by
LLMEffiChecker
significantly affect the battery power in real-world mobile devices (i.e., drain more than 30 times battery power than normal inputs).
Publisher
Association for Computing Machinery (ACM)
Reference95 articles.
1. AllenAI. 2022. Retrieved from https://huggingface.co/allenai/wmt16-en-de-dist-12-1
2. A general language assistant as a laboratory for alignment;Askell Amanda;CoRR,2021
3. Program synthesis with large language models;Austin Jacob;CoRR,2021
4. Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations (ICLR’18). OpenReview.net. Retrieved from https://openreview.net/forum?id=BJ8vJebC-
5. Wieland Brendel, Jonas Rauber, Matthias Kümmerer, Ivan Ustyuzhaninov, and Matthias Bethge. 2019. Accurate, reliable and fast robustness evaluation. In Annual Conference on Neural Information Processing Systems (NeurIPS’19), Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, and Roman Garnett (Eds.). 12841–12851. Retrieved from https://proceedings.neurips.cc/paper/2019/hash/885fe656777008c335ac96072a45be15-Abstract.html