Affiliation:
1. University of California, Riverside, USA
Abstract
The massive parallelism present in GPUs comes at the cost of reduced L1 and L2 cache sizes per thread, leading to serious cache contention problems such as thrashing. Hence, the data access locality of an application should be considered during thread scheduling to improve execution time and energy consumption. Recent works have tried to use the locality behavior of regular and structured applications in thread scheduling, but the difficult case of irregular and unstructured parallel applications remains to be explored.
We present
PAVER
, a
P
riority-
A
ware
V
ertex schedul
ER
, which takes a graph-theoretic approach toward thread scheduling. We analyze the cache locality behavior among
thread blocks
(
TBs
) through a just-in-time compilation, and represent the problem using a graph representing the TBs and the locality among them. This graph is then partitioned to TB groups that display maximum data sharing, which are then assigned to the same streaming multiprocessor by the locality-aware TB scheduler. Through exhaustive simulation in Fermi, Pascal, and Volta architectures using a number of scheduling techniques, we show that PAVER reduces L2 accesses by 43.3%, 48.5%, and 40.21% and increases the average performance benefit by 29%, 49.1%, and 41.2% for the benchmarks with high inter-TB locality.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference64 articles.
1. 2009. Retrieved April 11 2018 from https://github.com/gpgpu-sim/ispass2009-benchmarks. 2009. Retrieved April 11 2018 from https://github.com/gpgpu-sim/ispass2009-benchmarks.
2. Wireframe
Cited by
18 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Randomized Testing Framework for Dissecting NVIDIA GPGPU Thread Block-To-SM Scheduling Mechanisms;2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS);2023-12-17
2. Global Store Statement Aggregation;2023 IEEE 14th International Symposium on Parallel Architectures, Algorithms and Programming (PAAP);2023-11-24
3. Taming data locality for task scheduling under memory constraint in runtime systems;Future Generation Computer Systems;2023-06
4. L2 Cache Access Pattern Analysis using Static Profiling of an Application;2023 IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC);2023-06
5. OptiCPD: Optimization For The Canonical Polyadic Decomposition Algorithm on GPUs;2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2023-05