Affiliation:
1. University of Colorado
2. Indiana University
3. University of Oregon
Abstract
Scientific programmers often turn to vendor-tuned Basic Linear Algebra Subprograms (BLAS) to obtain portable high performance. However, many numerical algorithms require several BLAS calls in sequence, and those successive calls do not achieve optimal performance. The entire sequence needs to be optimized in concert. Instead of vendor-tuned BLAS, a programmer could start with source code in Fortran or C (e.g., based on the Netlib BLAS) and use a state-of-the-art optimizing compiler. However, our experiments show that optimizing compilers often attain only one-quarter of the performance of hand-optimized code. In this article, we present a domain-specific compiler for matrix kernels, the Build to Order BLAS (BTO), that reliably achieves high performance using a scalable search algorithm for choosing the best combination of loop fusion, array contraction, and multithreading for data parallelism. The BTO compiler generates code that is between 16% slower and 39% faster than hand-optimized code.
Publisher
Association for Computing Machinery (ACM)
Subject
Applied Mathematics,Software
Reference43 articles.
1. S. Amarasinghe D. Campbell W. Carlson etal 2009. Exascale software study: Software challenges in extreme scale systems. Tech. Rep. DARPA IPTO Air Force Research Labs. S. Amarasinghe D. Campbell W. Carlson et al. 2009. Exascale software study: Software challenges in extreme scale systems. Tech. Rep. DARPA IPTO Air Force Research Labs.
2. Achieving high sustained performance in an unstructured mesh CFD application
3. Can search algorithms save large-scale automatic performance tuning? Procedia;Balaprakash P.;Comput. Sci. CS,2011
4. Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. A Tensor Algebra Compiler for Sparse Differentiation;2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2024-03-02
2. Optimizing Tensor Programs on Flexible Storage;Proceedings of the ACM on Management of Data;2023-05-26
3. EGGS: Sparsity‐Specific Code Generation;Computer Graphics Forum;2020-08
4. SPIRAL: Extreme Performance Portability;Proceedings of the IEEE;2018-11
5. Format abstraction for sparse tensor algebra compilers;Proceedings of the ACM on Programming Languages;2018-10-24