Affiliation:
1. Northeastern University, Boston, MA, USA
Abstract
Graphics Processing Units (GPUs) have become an attractive platform for accelerating challenging applications on a range of platforms, from High Performance Computing (HPC) to full-featured smartphones. They can overcome computational barriers in a wide range of data-parallel kernels. GPUs hide pipeline stalls and memory latency by utilizing efficient thread preemption. But given the demands on the memory hierarchy due to the growth in the number of computing cores on-chip, it has become increasingly difficult to hide all of these stalls.
In this article, we propose a novel Hint-Assisted Wavefront Scheduler (HAWS) to bypass long-latency stalls. HAWS starts by enhancing a compiler infrastructure to identify potential opportunities that can bypass memory stalls. HAWS includes a wavefront scheduler that can continue to execute instructions in the shadow of a memory stall, executing instructions speculatively, guided by compiler-generated hints. HAWS increases utilization of GPU resources by aggressively fetching/executing speculatively. Based on our simulation results on the AMD Southern Islands GPU architecture, at an estimated cost of 0.4% total chip area, HAWS can improve application performance by 14.6% on average for memory intensive applications.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29
2. Simple Out of Order Core for GPGPUs;Proceedings of the 15th Workshop on General Purpose Processing Using GPU;2023-02-25
3. SIMR: Single Instruction Multiple Request Processing for Energy-Efficient Data Center Microservices;2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO);2022-10
4. A Fine-grained Prefetching Scheme for DGEMM Kernels on GPU with Auto-tuning Compatibility;2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS);2022-05
5. Repurposing GPU Microarchitectures with Light-Weight Out-Of-Order Execution;IEEE Transactions on Parallel and Distributed Systems;2022-02-01