Affiliation:
1. Extreme Computing Research Center, King Abdullah University of Science and Technology, Thuwal, Kingdom of Saudi Arabia
Abstract
Batched dense linear algebra kernels are becoming ubiquitous in scientific applications, ranging from tensor contractions in deep learning to data compression in hierarchical low-rank matrix approximation. Within a single API call, these kernels are capable of simultaneously launching up to thousands of similar matrix computations, removing the expensive overhead of multiple API calls while increasing the occupancy of the underlying hardware. A challenge is that for the existing hardware landscape (x86, GPUs, etc.), only a subset of the required batched operations is implemented by the vendors, with limited support for very small problem sizes. We describe the design and performance of a new class of batched triangular dense linear algebra kernels on very small data sizes (up to 256) using single and multiple GPUs. By deploying recursive formulations, stressing the register usage, maintaining data locality, reducing threads synchronization, and fusing successive kernel calls, the new batched kernels outperform existing state-of-the-art implementations.
Publisher
Association for Computing Machinery (ACM)
Subject
Applied Mathematics,Software
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Cache Optimization and Performance Modeling of Batched, Small, and Rectangular Matrix Multiplication on Intel, AMD, and Fujitsu Processors;ACM Transactions on Mathematical Software;2023-09-19
2. Using Additive Modifications in LU Factorization Instead of Pivoting;Proceedings of the 37th International Conference on Supercomputing;2023-06-21
3. Batched LU Factorization With Fast Row Interchanges for Small Matrices on GPUs;2022 IEEE 24th Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCity/DependSys);2022-12
4. Parallel Solution of Small and Medium Sized Linear Equations Based on GPU;2022 2nd International Conference on Electronic Information Technology and Smart Agriculture (ICEITSA);2022-12
5. High performance sparse multifrontal solvers on modern GPUs;Parallel Computing;2022-05