Quantifying Data Locality in Dynamic Parallelism in GPUs

Author:

Tang Xulong1,Pattnaik Ashutosh1,Kayiran Onur2,Jog Adwait3,Kandemir Mahmut Taylan1,Das Chita1

Affiliation:

1. Pennsylvania State University, state college, PA, USA

2. Advanced Micro Devices, Inc., Santa Clara, CA, USA

3. College of William and Mary, Williamsburg, VA, USA

Abstract

GPUs are becoming prevalent in various domains of computing and are widely used for streaming (regular) applications. However, they are highly inefficient when executing irregular applications with unstructured inputs due to load imbalance. Dynamic parallelism (DP) is a new feature of emerging GPUs that allows new kernels to be generated and scheduled from the device-side (GPU) without the host-side (CPU) intervention to increase parallelism. To efficiently support DP, one of the major challenges is to saturate the GPU processing elements and provide them with the required data in a timely fashion. There have been considerable efforts focusing on exploiting data locality in GPUs. However, there is a lack of quantitative analysis of how irregular applications using dynamic parallelism behave in terms of data reuse. In this paper, we quantitatively analyze the data reuse of dynamic applications in three different granularities of schedulable units: kernel, work-group, and wavefront. We observe that, for DP applications, data reuse is highly irregular and is heavily dependent on the application and its input. Thus, existing techniques cannot exploit data reuse effectively for DP applications. To this end, we first conduct a limit study on the performance improvements that can be achieved by hardware schedulers that are provided with accurate data reuse information. This limit study shows that, on an average, the performance improves by 19.4% over the baseline scheduler. Based on the key observations from the quantitative analysis of our DP applications, we next propose LASER, a Locality-Aware SchedulER, where the hardware schedulers employ data reuse monitors to help make scheduling decisions to improve data locality at runtime. Our experimental results on 16 benchmarks show that LASER, on an average, can improve performance by 11.3%.

Funder

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

General Medicine

Reference60 articles.

1. Vignesh Adhinarayanan Indrani Paul Joseph Greathouse Wei N. Huang Ashutosh Pattnaik and Wu chun Feng. 2016. Measuring and Modeling On-Chip Interconnect Power on Real Hardware. In IISWC. Vignesh Adhinarayanan Indrani Paul Joseph Greathouse Wei N. Huang Ashutosh Pattnaik and Wu chun Feng. 2016. Measuring and Modeling On-Chip Interconnect Power on Real Hardware. In IISWC.

2. AMD. 2013. AMD APP SDK OpenCL User Guide. (2013). AMD. 2013. AMD APP SDK OpenCL User Guide. (2013).

3. Austin Appleby. 2016. Murmur Hash 2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp. (2016). Austin Appleby. 2016. Murmur Hash 2. https://github.com/aappleby/smhasher/blob/master/src/MurmurHash2.cpp. (2016).

4. Efficient performance evaluation of memory hierarchy for highly multithreaded graphics processors

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Analyzing Data Locality on GPU Caches Using Static Profiling of Workloads;IEEE Access;2023

2. A Compiler Framework for Optimizing Dynamic Parallelism on GPUs;2022 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2022-04-02

3. Computing with Near Data;Proceedings of the ACM on Measurement and Analysis of Computing Systems;2018-12-21

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3