Affiliation:
1. Imperial College London, London, United Kingdom
Abstract
Thread coarsening on GPUs combines the work of several threads into one. We show how thread coarsening can be implemented as a fully automated compile-time optimisation that estimates the optimal coarsening factor based on a low-cost, approximate static analysis of cache line re-use and an occupancy prediction model. We evaluate two coarsening strategies on three different NVidia GPU architectures. For NVidia reduction kernels we achieve a maximum speedup of 5.08x, and for the Rodinia benchmarks we achieve a mean speedup of 1.30x over 8 of 19 kernels that were determined safe to coarsen.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Cited by
6 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29
2. Retargeting and Respecializing GPU Workloads for Performance Portability;2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2024-03-02
3. Future aware Dynamic Thermal Management in CPU-GPU Embedded Platforms;2022 IEEE Real-Time Systems Symposium (RTSS);2022-12
4. A Compiler Framework for Optimizing Dynamic Parallelism on GPUs;2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2022-04-02
5. Exploring Thread Coarsening on FPGA;2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC);2021-12