Affiliation:
1. COMPUTER SCIENCE DIVISION DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES UNIVERSITY OF CALIFORNIA AT BERKELEY, BERKELEY, CA 94720, USA
2. COMPUTER SCIENCE DIVISION DEPARTMENT OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES AND DEPARTMENT OF MATHEMATICS UNIVERSITY OF CALIFORNIA AT BERKELEY, BERKELEY, CA 94720, USA
3. DEPARTMENT OF ELECTRICAL ENGINEERING UNIVERSITY OF WASHINGTON, SEATTLE, WA, USA
Abstract
Achieving peak performance from the computational kernels that dominate application performance often requires extensive machine-dependent tuning by hand. Automatic tuning systems have emerged in response, and they typically operate by (1) generating a large number of possible, reasonable implementations of a kernel, and (2) selecting the fastest implementation by a combination of heuristic modeling, heuristic pruning, and empirical search (i.e. actually running the code). This paper presents quantitative data that motivate the development of such a search-based system, using dense matrix multiply as a case study. The statistical distributions of performance within spaces of reasonable implementations, when observed on a variety of hardware platforms, lead us to pose and address two general problems which arise during the search process. First, we develop a heuristic for stopping an exhaustive compiletime search early if a near-optimal implementation is found. Secondly, we show how to construct run-time decision rules, based on run-time inputs, for selecting from among a subset of the best implementations when the space of inputs can be described by continuously varying features. We address both problems by using statistical modeling techniques that exploit the large amount of performance data collected during the search. We demonstrate these methods on actual performance data collected by the PHiPAC tuning system for dense matrix multiply. We close with a survey of recent projects that use or otherwise advocate an empirical search-based approach to code generation and algorithm selection, whether at the level of computational kernels, compiler and run-time systems, or problem-solving environments. Collectively, these efforts suggest a number of possible software architectures for constructing platform-adapted libraries and applications.
Subject
Hardware and Architecture,Theoretical Computer Science,Software
Cited by
52 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Application Performance Modeling via Tensor Completion;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11
2. Optimal Launch Bound Selection in CPU-GPU Hybrid Graph Applications with Deep Learning;2022 IEEE 13th International Green and Sustainable Computing Conference (IGSC);2022-10-24
3. Characterizing Input-sensitivity in Tightly-Coupled Collaborative Graph Algorithms;2021 IEEE/ACM 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid);2021-05
4. Noise-Resilient Empirical Performance Modeling with Deep Neural Networks;2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS);2021-05
5. Extracting clean performance models from tainted programs;Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming;2021-02-17