Affiliation:
1. Ohio State University, USA
2. IBM, USA
3. IBM, Canada
Abstract
Exploitation of parallel architectures has become critical to scalable machine learning (ML). Since a wide range of ML algorithms employ linear algebraic operators, GPUs with BLAS libraries are a natural choice for such an exploitation. Two approaches are commonly pursued: (i) developing specific GPU accelerated implementations of complete ML algorithms; and (ii) developing GPU kernels for primitive linear algebraic operators like matrix-vector multiplication, which are then used in developing ML algorithms. This paper extends the latter approach by developing fused kernels for a combination of primitive operators that are commonly found in popular ML algorithms. We identify the generic pattern of computation (alpha * X^T (v * (X * y)) + beta * z) and its various instantiations. We develop a fused kernel to optimize this computation on GPUs -- with specialized techniques to handle both sparse and dense matrices. This approach not only reduces the cost of data loads due to improved temporal locality but also enables other optimizations like coarsening and hierarchical aggregation of partial results. We also present an analytical model that considers input data characteristics and available GPU resources to estimate near-optimal settings for kernel launch parameters. The proposed approach provides speedups ranging from 2 to 67 for different instances of the generic pattern compared to launching multiple operator-level kernels using GPU accelerated libraries. We conclude by demonstrating the effectiveness of the approach in improving end-to-end performance on an entire ML algorithm.
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Cited by
13 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Optimizing Deep Learning Inference via Global Analysis and Tensor Expressions;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1;2024-04-17
2. NIOT: A Novel Inference Optimization of Transformers on Modern CPUs;IEEE Transactions on Parallel and Distributed Systems;2023-06
3. Collage;Proceedings of the International Conference on Parallel Architectures and Compilation Techniques;2022-10-08
4. Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework;ACM Transactions on Embedded Computing Systems;2022-09-30
5. FuseME: Distributed Matrix Computation Engine based on Cuboid-based Fused Operator and Plan Generation;Proceedings of the 2022 International Conference on Management of Data;2022-06-10