BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach-Reference-Cited by-同舟云学术

BladeDISC: Optimizing Dynamic Shape Machine Learning Workloads via Compiler Approach

Published:2023-11-13 Issue:3 Volume:1 Page:1-29
ISSN:2836-6573
Container-title:Proceedings of the ACM on Management of Data
language:en
Short-container-title:Proc. ACM Manag. Data

Author:

Zheng Zhen¹^ORCID,Pan Zaifeng²^ORCID,Wang Dalin²^ORCID,Zhu Kai³^ORCID,Zhao Wenyi¹^ORCID,Guo Tianyou³^ORCID,Qiu Xiafei¹^ORCID,Sun Minmin⁴^ORCID,Bai Junjie¹^ORCID,Zhang Feng⁵^ORCID,Du Xiaoyong⁵^ORCID,Zhai Jidong⁶^ORCID,Lin Wei¹^ORCID

Affiliation:

1. Alibaba Group, Hangzhou, China

2. Renmin University of China & Alibaba Group, Beijing, China

3. Alibaba Group, Beijing, China

4. Alibaba Group, Shanghai, China

5. Renmin University of China, Beijing, China

6. Tsinghua University, Beijing, China

Abstract

Compiler optimization plays an increasingly important role to boost the performance of machine learning models for data processing and management. With increasingly complex data, the dynamic tensor shape phenomenon emerges for ML models. However, existing ML compilers either can only handle static shape models or expose a series of performance problems for both operator fusion optimization and code generation in dynamic shape scenes. This paper tackles the main challenges of dynamic shape optimization: the fusion optimization without shape value, and code generation supporting arbitrary shapes. To tackle the fundamental challenge of the absence of shape values, it systematically abstracts and excavates the shape information and designs a cross-level symbolic shape representation. With the insight that what fusion optimization relies upon is tensor shape relationships between adjacent operators rather than exact shape values, it proposes the dynamic shape fusion approach based on shape information propagation. To generate code that adapts to arbitrary shapes efficiently, it proposes a compile-time and runtime combined code generation approach. Finally, it presents a complete optimization pipeline for dynamic shape models and implements an industrial-grade ML compiler, named BladeDISC. The extensive evaluation demonstrates that BladeDISC outperforms PyTorch, TorchScript, TVM, ONNX Runtime, XLA, Torch Inductor (dynamic shape), and TensorRT by up to 6.95×, 6.25×, 4.08×, 2.04×, 2.06×, 7.92×, and 4.16× (3.54×, 3.12×, 1.95×, 1.47×, 1.24×, 2.93×, and 1.46× on average) in terms of end-to-end inference speedup on the A10 and T4 GPU, respectively. BladeDISC's source code is publicly available at https://github.com/alibaba/BladeDISC.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3617327

Reference94 articles.

1. Cited April 2023. Stablehlo backward compatible ML compute opset inspired by HLO/MHLO. https://github.com/openxla/stablehlo. Cited April 2023. Stablehlo backward compatible ML compute opset inspired by HLO/MHLO. https://github.com/openxla/stablehlo.

2. Cited January 2023. Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas. Cited January 2023. Basic Linear Algebra on NVIDIA GPUs. https://developer.nvidia.com/cublas.

3. Cited January 2023. CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass. Cited January 2023. CUDA Templates for Linear Algebra Subroutines. https://github.com/NVIDIA/cutlass.

4. Cited January 2023. Introduction to TorchScript. https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html. Cited January 2023. Introduction to TorchScript. https://pytorch.org/tutorials/beginner/Intro_to_TorchScript_tutorial.html.

5. Cited January 2023. IREE. https://github.com/google/iree. Cited January 2023. IREE. https://github.com/google/iree.