Register Tiling for Unstructured Sparsity in Neural Network Inference-Reference-Cited by-同舟云学术

Published:2023-06-06 Issue:PLDI Volume:7 Page:1995-2020
ISSN:2475-1421
Container-title:Proceedings of the ACM on Programming Languages
language:en
Short-container-title:Proc. ACM Program. Lang.

Author:

Wilkinson Lucas¹^ORCID,Cheshmi Kazem²^ORCID,Dehnavi Maryam Mehri¹^ORCID

Affiliation:

1. University of Toronto, Canada

2. McMaster University, Canada

Abstract

Unstructured sparse neural networks are an important class of machine learning (ML) models, as they compact model size and reduce floating point operations. The execution time of these models is frequently dominated by the sparse matrix multiplication (SpMM) kernel, C = A × B , where A is a sparse matrix, and B and C are dense matrices. The unstructured sparsity pattern of matrices in pruned machine learning models along with their sparsity ratio has rendered useless the large class of libraries and systems that optimize sparse matrix multiplications. Reusing registers is particularly difficult because accesses to memory locations should be known statically. This paper proposes Sparse Register Tiling, a new technique composed of an unroll-and-sparse-jam transformation followed by data compression that is specifically tailored to sparsity patterns in ML matrices. Unroll-and-sparse-jam uses sparsity information to jam the code while improving register reuse. Sparse register tiling is evaluated across 2396 weight matrices from transformer and convolutional models with a sparsity range of 60-95% and provides an average speedup of 1.72× and 2.65× over MKL SpMM and dense matrix multiplication, respectively, on a multicore CPU processor. It also provides an end-to-end speedup of 2.12× for MobileNetV1 with 70% sparsity on an ARM processor commonly used in edge devices.

Funder

NSERC

NSERC Discovery

Publisher

Association for Computing Machinery (ACM)

Subject

Safety, Risk, Reliability and Quality,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3591302

Reference64 articles.

1. uops.info

2. Hasan Metin Aktulga , Aydin Buluç , Samuel Williams , and Chao Yang . 2014 . Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1213–1222 . https://doi.org/10.1109/IPDPS.2014.125 10.1109/IPDPS.2014.125 10.1109/IPDPS.2014.125 Hasan Metin Aktulga, Aydin Buluç, Samuel Williams, and Chao Yang. 2014. Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations. In 2014 IEEE 28th International Parallel and Distributed Processing Symposium. 1213–1222. https://doi.org/10.1109/IPDPS.2014.125 10.1109/IPDPS.2014.125

3. MOSEK ApS. 2022. MOSEK Optimization Suite. https://docs.mosek.com/10.0/pythonapi.pdf MOSEK ApS. 2022. MOSEK Optimization Suite. https://docs.mosek.com/10.0/pythonapi.pdf

4. ARM. 2015. Cortex-A72 Software Optimization Guide Application Note UAN 0016A. https://developer.arm.com/documentation/uan0016/a/ ARM. 2015. Cortex-A72 Software Optimization Guide Application Note UAN 0016A. https://developer.arm.com/documentation/uan0016/a/

5. ARM. 2022. ARM Compute Library. https://github.com/ARM-software/ComputeLibrary ARM. 2022. ARM Compute Library. https://github.com/ARM-software/ComputeLibrary

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Runtime Composition of Iterations for Fusing Loop-carried Sparse Dependence;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11