Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs-Reference-Cited by-同舟云学术

Improving Computation and Memory Efficiency for Real-world Transformer Inference on GPUs

Published:2023-10-26 Issue:4 Volume:20 Page:1-22
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Du Jiangsu¹^ORCID,Jiang Jiazhi¹^ORCID,Zheng Jiang¹^ORCID,Zhang Hongbin¹^ORCID,Huang Dan¹^ORCID,Lu Yutong¹^ORCID

Affiliation:

1. School of Computer Science and Engineering, Sun Yat-sen University, China

Abstract

Transformer models have emerged as a leading approach in the field of natural language processing (NLP) and are increasingly being deployed in production environments. Graphic processing units (GPUs) have become a popular choice for the transformer deployment and often rely on the batch processing technique to ensure high hardware performance. Nonetheless, the current practice for transformer inference encounters computational and memory redundancy due to the heavy-tailed distribution of sequence lengths in NLP scenarios, resulting in low practical performance. In this article, we propose a unified solution for improving both computation and memory efficiency of the real-world transformer inference on GPUs. The solution eliminates the redundant computation and memory footprint across a transformer model. At first, a GPU-oriented computation approach is proposed to process the self-attention module in a fine-grained manner, eliminating its redundant computation. Next, the multi-layer perceptron module continues to use the word-accumulation approach to eliminate its redundant computation. Then, to better unify the fine-grained approach and the word-accumulation approach, it organizes the data layout of the self-attention module in block granularity. Since aforementioned approaches make the required memory size largely reduce and constantly fluctuate, we propose the chunk-based approach to enable a better balance between memory footprint and allocation/free efficiency. Our experimental results show that our unified solution achieves a decrease of average latency by 28% on the entire transformer model, 63.8% on the self-attention module, and reduces memory footprint of intermediate results by 7.8×, compared with prevailing frameworks.

Funder

National Key R&D Program of China

Major Program of Guangdong Basic and Applied Research

Natural Science Foundation of China

Guangdong Province Special Support Program for Cultivating High-Level Talents

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3617689

Reference32 articles.

1. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard et al. 2016. TensorFlow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16). 265–283.

2. Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley et al. 2022. DeepSpeed-inference: Enabling efficient inference of transformer models at unprecedented scale. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’22). IEEE, 1–15.

3. Michaël Benesty. 2021. Hugging Face Transformer Inference Under 1 Millisecond Latency. Retrieved August 7, 2023 from https://towardsdatascience.com/hugging-face-transformer-inference-under-1-millisecond-latency-e1be0057a51c

4. Shiyang Chen, Shaoyi Huang, Santosh Pandey, Bingbing Li, Guang R. Gao, Long Zheng, Caiwen Ding, and Hang Liu. 2021. E.T.: Re-thinking self-attention for transformer models on GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–18.

5. Tianqi Chen, Thierry Moreau, Ziheng Jiang, Lianmin Zheng, Eddie Yan, Haichen Shen, Meghan Cowan, Leyuan Wang, Yuwei Hu, Luis Ceze et al. 2018. TVM: An automated end-to-end optimizing compiler for deep learning. In Proceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18). 578–594.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. DESTINE: Dynamic Goal Queries with Temporal Transductive Alignment for Trajectory Prediction;2024 IEEE International Conference on Robotics and Automation (ICRA);2024-05-13