Abstract
When implementing a function mapping on the contemporary GPU, several contradictory performance factors affecting distribution of computation into GPU kernels have to be balanced. A decomposition-fusion scheme suggests to decompose the computational problem to be solved by several simple functions implemented as standalone kernels and to fuse some of these functions later into more complex kernels to improve memory locality. In this paper, a prototype of source-to-source compiler automating the fusion phase is presented and the impact of fusions generated by the compiler as well as compiler efficiency is experimentally evaluated.
Publisher
Association for Computing Machinery (ACM)
Reference4 articles.
1. Jared Hoberock and Nathan Bell. Thrust: A Parallel Template Library 2010. Version 1.3.0. Jared Hoberock and Nathan Bell. Thrust: A Parallel Template Library 2010. Version 1.3.0.
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives;Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2;2024-04-27
2. Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol;Concurrency and Computation: Practice and Experience;2024-02-13
3. gSampler: General and Efficient GPU-based Graph Sampling for Graph Learning;Proceedings of the 29th Symposium on Operating Systems Principles;2023-10-23
4. Tale of Two Cs: Computation vs. Communication Scaling for Future Transformers on Future Hardware;2023 IEEE International Symposium on Workload Characterization (IISWC);2023-10-01
5. Demystifying BERT: System Design Implications;2022 IEEE International Symposium on Workload Characterization (IISWC);2022-11