Cross-core Data Sharing for Energy-efficient GPUs

Author:

Falahati Hajar1ORCID,Sadrosadati Mohammad2ORCID,Xu Qiumin3ORCID,Gómez-Luna Juan4ORCID,Saber Latibari Banafsheh5ORCID,Jeon Hyeran6ORCID,Hesaabi Shaahin5ORCID,Sarbazi-Azad Hamid7ORCID,Mutlu Onur8ORCID,Annavaram Murali3ORCID,Pedram Masoud3ORCID

Affiliation:

1. Sharif University of Technology, School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran

2. School of Computer Science, IPM, Tehran, Iran

3. University of Southern California, Los Angeles, USA

4. ETH Zürich, Zürich, Switzerland

5. Sharif University of Technology, Tehran, Iran

6. San José State University, San Jose, USA

7. Sharif University of Technology, School of Computer Science, IPM, Tehran, Iran

8. ETH Zürich, Carnegie Mellon University, Zürich, Switzerland

Abstract

Graphics Processing Units (GPUs) are the accelerator of choice in a variety of application domains, because they can accelerate massively parallel workloads and can be easily programmed using general-purpose programming frameworks such as CUDA and OpenCL. Each Streaming Multiprocessor (SM) contains an L1 data cache (L1D) to exploit the locality in data accesses. L1D misses are costly for GPUs for two reasons. First, L1D misses consume a lot of energy as they need to access the L2 cache (L2) via an on-chip network and the off-chip DRAM in case of L2 misses. Second, L1D misses impose performance overhead if the GPU does not have enough active warps to hide the long memory access latency. We observe that threads running on different SMs share 55% of the data they read from the memory. Unfortunately, as the L1Ds are in the non-coherent memory domain, each SM independently fetches data from the L2 or the off-chip memory into its L1D, even though the data may be currently available in the L1D of another SM. Our goal is to service L1D read misses via other SMs, as much as possible, to cut down costly accesses to the L2 or the off-chip DRAM. To this end, we propose a new data-sharing mechanism, called Cross-Core Data Sharing (CCDS) . CCDS employs a predictor to estimate whether the required cache block exists in another SM. If the block is predicted to exist in another SM’s L1D, then CCDS fetches the data from the L1D that contain the block. Our experiments on a suite of 26 workloads show that CCDS improves average energy and performance by 1.30× and 1.20×, respectively, compared to the baseline GPU. Compared to the state-of-the-art data-sharing mechanism, CCDS improves average energy and performance by 1.37× and 1.11×, respectively.

Publisher

Association for Computing Machinery (ACM)

Reference117 articles.

1. 2009. Whitepaper: NVIDIA’s Next Generation CUDA Compute Architecture: Fermi. Technical Report. NVIDIA.

2. 2012. Whitepaper: NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. Technical Report. NVIDIA.

3. 2014. Whitepaper: NVIDIA GeForce GTX980. Technical Report. NVIDIA.

4. 2016. Whitepaper: NVIDIA GeForce GP100. Technical Report. NVIDIA.

5. Mohammad Abdel-Majeed and Murali Annavaram. 2013. Warped register file: A power efficient register file for GPGPUs. In HPCA.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3