Optimizing sparse general matrix–matrix multiplication for DCUs-Reference-Cited by-同舟云学术

Optimizing sparse general matrix–matrix multiplication for DCUs

Published:2024-05-30 Issue:14 Volume:80 Page:20176-20200
ISSN:0920-8542
Container-title:The Journal of Supercomputing
language:en
Short-container-title:J Supercomput

Author:

Guo Hengliang,Wang Haolei,Chen Wanting,Zhang Congxiang,Han Yubo,Zhu Shengguang,Zhang Dujuan,Guo Yang,Shang Jiandong,Wan Tao,Li Qingyang,Wu Gang

Abstract

AbstractSparse general matrix–matrix multiplication (SpGEMM) is a crucial and complex computational task in many practical applications. Improving the performance of SpGEMM on SIMT processors like modern GPUs is challenging due to the unpredictable sparsity of sparse matrices. Although existing GPU solutions have made progress in improving performance through advanced algorithm design, they ignore some optimizations related to specific processor architectures. This can result in a partially inefficient implementation of their algorithms. This paper focuses on optimizing four inefficient parts of the NSparse algorithm on DCU (a GPU-like accelerator). The optimizations include: 1) setting parameters to improve the load balance of the second matrix by extracting maximum row information at runtime; 2) reducing overhead of binning operations by making full use of registers and shared memory effectively; 3) improving numerical SpGEMM performance by adjusting its calculation mode; and 4) enhancing global load balance through finer-grained grouping and kernel configurations. Experiment results demonstrate that when compared to five state-of-the-art SpGEMM algorithms (bhSparse, KokkosKernels, NSparse, rocSparse, and spECK), our optimized method achieves an average of 7.99x (up to 18.2x), 8.01x (up to 20.83x), 2.37x (up to 6.16x), 1.82x (up to 4.20x), and 1.63x (up to 5.01x) speedups on 29 sparse matrices with different sparse structures, respectively.

Funder

Major Science and Technology Special Projects in Henan Province

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11227-024-06234-2.pdf

Reference39 articles.

1. Bell N, Dalton S, Olson LN (2012) Exposing fine-grained parallelism in algebraic multigrid methods. SIAM J Sci Comput 34(4):C123–C152. https://doi.org/10.1137/110838844

2. Ballard G, Siefert C, Hu J (2016) Reducing communication costs for sparse matrix multiplication within algebraic multigrid. SIAM J Sci Comput 38(3):C203–C231. https://doi.org/10.1137/15M1028807

3. Then M, Kaufmann M, Chirigati F, et al (2014) The more the merrier: efficient multi-source graph traversal. Proc VLDB Endow 8(4):449–460. https://doi.org/10.14778/2735496.2735507

4. Buluç A, Madduri K (2011) Parallel breadth-first search on distributed memory systems. In: Conference on High Performance Computing Networking, Storage and Analysis, pp 65:1–65:12. https://doi.org/10.1145/2063384.2063471

5. Kaplan H, Sharir M, Verbin E (2006) Colored intersection searching via sparse rectangular matrix multiplication. In: Proceedings of the 22nd ACM Symposium on Computational Geometry, pp 52–60. https://doi.org/10.1145/1137856.1137866