Quantifying Data Locality in Dynamic Parallelism in GPUs-Reference-Cited by-同舟云学术

Quantifying Data Locality in Dynamic Parallelism in GPUs

Published:2018-12-21 Issue:3 Volume:2 Page:1-24
ISSN:2476-1249
Container-title:Proceedings of the ACM on Measurement and Analysis of Computing Systems
language:en
Short-container-title:Proc. ACM Meas. Anal. Comput. Syst.

Author:

Tang Xulong¹,Pattnaik Ashutosh¹,Kayiran Onur²,Jog Adwait³,Kandemir Mahmut Taylan¹,Das Chita¹

Affiliation:

1. Pennsylvania State University, state college, PA, USA

2. Advanced Micro Devices, Inc., Santa Clara, CA, USA

3. College of William and Mary, Williamsburg, VA, USA

Abstract

GPUs are becoming prevalent in various domains of computing and are widely used for streaming (regular) applications. However, they are highly inefficient when executing irregular applications with unstructured inputs due to load imbalance. Dynamic parallelism (DP) is a new feature of emerging GPUs that allows new kernels to be generated and scheduled from the device-side (GPU) without the host-side (CPU) intervention to increase parallelism. To efficiently support DP, one of the major challenges is to saturate the GPU processing elements and provide them with the required data in a timely fashion. There have been considerable efforts focusing on exploiting data locality in GPUs. However, there is a lack of quantitative analysis of how irregular applications using dynamic parallelism behave in terms of data reuse. In this paper, we quantitatively analyze the data reuse of dynamic applications in three different granularities of schedulable units: kernel, work-group, and wavefront. We observe that, for DP applications, data reuse is highly irregular and is heavily dependent on the application and its input. Thus, existing techniques cannot exploit data reuse effectively for DP applications. To this end, we first conduct a limit study on the performance improvements that can be achieved by hardware schedulers that are provided with accurate data reuse information. This limit study shows that, on an average, the performance improves by 19.4% over the baseline scheduler. Based on the key observations from the quantitative analysis of our DP applications, we next propose LASER, a Locality-Aware SchedulER, where the hardware schedulers employ data reuse monitors to help make scheduling decisions to improve data locality at runtime. Our experimental results on 16 benchmarks show that LASER, on an average, can improve performance by 11.3%.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

General Medicine

Link

https://dl.acm.org/doi/pdf/10.1145/3287318

Reference60 articles.

1. Vignesh Adhinarayanan Indrani Paul Joseph Greathouse Wei N. Huang Ashutosh Pattnaik and Wu chun Feng. 2016. Measuring and Modeling On-Chip Interconnect Power on Real Hardware. In IISWC. Vignesh Adhinarayanan Indrani Paul Joseph Greathouse Wei N. Huang Ashutosh Pattnaik and Wu chun Feng. 2016. Measuring and Modeling On-Chip Interconnect Power on Real Hardware. In IISWC.

2. AMD. 2013. AMD APP SDK OpenCL User Guide. (2013). AMD. 2013. AMD APP SDK OpenCL User Guide. (2013).

3. Austin Appleby. 2016. Murmur Hash 2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp. (2016). Austin Appleby. 2016. Murmur Hash 2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp. (2016).

4. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Analyzing Data Locality on GPU Caches Using Static Profiling of Workloads;IEEE Access;2023

2. A Compiler Framework for Optimizing Dynamic Parallelism on GPUs;2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2022-04-02

3. Computing with Near Data;Proceedings of the ACM on Measurement and Analysis of Computing Systems;2018-12-21