Affiliation:
1. Pacific Northwest National Lab, Richland, WA, USA
2. University of Copenhagen, Copenhagen, Denmark
3. College of William and Mary, Williamsburg, VA, USA
4. Technische Universität Dresden, Dresden, Germany
5. Eindhoven University of Technology, Eindhoven , Netherlands
Abstract
Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performance boosting potential --- the inter-CTA locality. Exploiting such locality is rather challenging due to unclear hardware feasibility, unknown and inaccessible underlying CTA scheduler, and small in-core cache capacity. To address these issues, we first conduct a thorough empirical exploration on various modern GPUs and demonstrate that inter-CTA locality can be harvested, both spatially and temporally, on L1 or L1/Tex unified cache. Through further quantification process, we prove the significance and commonality of such locality among GPU applications, and discuss whether such reuse is exploitable. By leveraging these insights, we propose the concept of CTA-Clustering and its associated software-based techniques to reshape the default CTA scheduling in order to group the CTAs with potential reuse together on the same SM. Our techniques require no hardware modification and can be directly deployed on existing GPUs. In addition, we incorporate these techniques into an integrated framework for automatic inter-CTA locality optimization. We evaluate our techniques using a wide range of popular GPU applications on all modern generations of NVIDIA GPU architectures. The results show that our proposed techniques significantly improve cache performance through reducing L2 cache transactions by 55%, 65%, 29%, 28% on average for Fermi, Kepler, Maxwell and Pascal, respectively, leading to an average of 1.46x, 1.48x, 1.45x, 1.41x (up to 3.8x, 3.6x, 3.1x, 3.3x) performance speedups for applications with algorithm-related inter-CTA reuse.
Funder
HiPEAC Collaboration Grants
Department of Energy
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Graphics and Computer-Aided Design,Software
Cited by
7 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. LATOA: Load-Aware Task Offloading and Adoption in GPU;Proceedings of the 15th Workshop on General Purpose Processing Using GPU;2023-02-25
2. Analyzing Data Locality on GPU Caches Using Static Profiling of Workloads;IEEE Access;2023
3. Locality-Aware CTA Scheduling for Gaming Applications;ACM Transactions on Architecture and Code Optimization;2022-03-31
4. SV-sim;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2021-11-13
5. Quantifying the NUMA Behavior of Partitioned GPGPU Applications;Proceedings of the 12th Workshop on General Purpose Processing Using GPUs - GPGPU '19;2019