Thread Batching for High-performance Energy-efficient GPU Memory Design-Reference-Cited by-同舟云学术

Thread Batching for High-performance Energy-efficient GPU Memory Design

Published:2019-10-31 Issue:4 Volume:15 Page:1-21
ISSN:1550-4832
Container-title:ACM Journal on Emerging Technologies in Computing Systems
language:en
Short-container-title:J. Emerg. Technol. Comput. Syst.

Author:

Li Bing¹^ORCID,Mao Mengjie²,Liu Xiaoxiao³,Liu Tao⁴,Liu Zihao⁴,Wen Wujie⁴,Chen Yiran⁵,Li Hai (Helen)⁵

Affiliation:

1. Duke University, USA and Army Research Office, Research Triangle Park, USA

2. MathWorks Inc., USA

3. AMD, USA

4. Florida International University, Miami, FL, USA

5. Duke University, Durham, North Carolina, USA

Abstract

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU’s performance and energy efficiency. In this article, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. First, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Second, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications.

Funder

NRC Associate Fellowship Award

U.S. National Science Foundation

U.S. Department of Energy

Publisher

Association for Computing Machinery (ACM)

Subject

Electrical and Electronic Engineering,Hardware and Architecture,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3330152

Reference46 articles.

1. Warped register file: A power efficient register file for GPGPUs

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. POTDP: Research GPU Performance Optimization Method based on Thread Dynamic Programming;2022 IEEE 4th International Conference on Power, Intelligent Computing and Systems (ICPICS);2022-07-29