LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs-Reference-Cited by-同舟云学术

LO-SpMM: Low-cost Search for High-performance SpMM Kernels on GPUs

Published:2024-07-29 Issue: Volume: Page:
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Lin Junqing¹^ORCID,Sun Jingwei¹^ORCID,Shi Xiaolong¹^ORCID,Zhang Honghe¹^ORCID,Yu Xianzhi²^ORCID,Wang Xinzhi²^ORCID,Yao Jun²^ORCID,Sun Guangzhong¹^ORCID

Affiliation:

1. Computer Science and Technology, University of Science and Technology of China, Hefei, China

2. Huawei Noah's Ark Lab, Shenzhen, China

Abstract

As deep neural networks (DNNs) become increasingly large and complicated, pruning techniques are proposed for lower memory footprint and more efficient inference. The most critical kernel to execute pruned sparse DNNs on GPUs is Sparse-dense Matrix Multiplication (SpMM). To maximize the performance of SpMM, despite the high-performance implementation generated from advanced tensor compilers, they often take a long time to iteratively search tuning configurations. Such a long time slows down the cycle of exploring better DNN architectures or pruning algorithms. In this paper, we propose LO-SpMM to efficiently generate high-performance SpMM implementations for sparse DNN inference. Based on the analysis of nonzero elements’ layout, the characterization of the GPU architecture, and a rank-based cost model, LO-SpMM can effectively reduce the search space and eliminate possibly low-performance candidates. Besides, rather than generating complete SpMM implementations for evaluation, LO-SpMM constructs simplified proxies to quickly estimate performance, thereby substantially reducing compilation and execution costs. Experimental results show that LO-SpMM can reduce the search time by 281 × at most, while the performance of generated SpMM implementations is comparable to or better than the state-of-the-art sparse tensor compiling solutions.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/3685277

Reference70 articles.

1. 2022. Basic Linear Algebra on NVIDIA GPUs. https://docs.nvidia.com/cuda/cublas/index.html.

2. 2022. A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication. https://docs.nvidia.com/cuda/cusparse/index.html.

3. 2022. The sdk for high-performance deep learning inference. https://docs.nvidia.com/deeplearning/tensorrt/.

4. 2024. CUDA C++ Programming Guide. https://docs.nvidia.com/cuda/cuda-c-programming-guide/.

5. OpenTuner