A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors-Reference-Cited by-同舟云学术

A Hierarchical Thread Scheduler and Register File for Energy-Efficient Throughput Processors

Published:2012-04 Issue:2 Volume:30 Page:1-38
ISSN:0734-2071
Container-title:ACM Transactions on Computer Systems
language:en
Short-container-title:ACM Trans. Comput. Syst.

Author:

Gebhart Mark¹,Johnson Daniel R.²,Tarjan David³,Keckler Stephen W.⁴,Dally William J.⁵,Lindholm Erik³,Skadron Kevin⁶

Affiliation:

1. The University of Texas at Austin

2. University of Illinois at Urbana-Champaign

3. NVIDIA

4. NVIDIA and The University of Texas at Austin

5. NVIDIA and Stanford University

6. University of Virginia

Abstract

Modern graphics processing units (GPUs) employ a large number of hardware threads to hide both function unit and memory access latency. Extreme multithreading requires a complex thread scheduler as well as a large register file, which is expensive to access both in terms of energy and latency. We present two complementary techniques for reducing energy on massively-threaded processors such as GPUs. First, we investigate a two-level thread scheduler that maintains a small set of active threads to hide ALU and local memory access latency and a larger set of pending threads to hide main memory latency. Reducing the number of threads that the scheduler must consider each cycle improves the scheduler’s energy efficiency. Second, we propose replacing the monolithic register file found on modern designs with a hierarchical register file. We explore various trade-offs for the hierarchy including the number of levels in the hierarchy and the number of entries at each level. We consider both a hardware-managed caching scheme and a software-managed scheme, where the compiler is responsible for orchestrating all data movement within the register file hierarchy. Combined with a hierarchical register file, our two-level thread scheduler provides a further reduction in energy by only allocating entries in the upper levels of the register file hierarchy for active threads. Averaging across a variety of real world graphics and compute workloads, the active thread count can be reduced by a factor of 4 with minimal impact on performance and our most efficient three-level software-managed register file hierarchy reduces register file energy by 54%.

Funder

Division of Computing and Communication Foundations

Defense Advanced Research Projects Agency

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/2166879.2166882

Reference56 articles.

1. APRIL

2. AMD. 2010. ATI Stream Computing OpenCL Programming Guide. http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf. AMD . 2010. ATI Stream Computing OpenCL Programming Guide. http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_OpenCL_Programming_Guide.pdf.

3. AMD. 2011. HD 6900 series instruction set architecture. http://developer.amd.com/gpu/amdappsdk/assets/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf. AMD . 2011. HD 6900 series instruction set architecture. http://developer.amd.com/gpu/amdappsdk/assets/AMD_HD_6900_Series_Instruction_Set_Architecture.pdf.

4. Power-Aware Compilation for Register File Energy Reduction

Cited by 23 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. PresCount: Effective Register Allocation for Bank Conflict Reduction;2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2024-03-02

2. Many-BSP: an analytical performance model for CUDA kernels;Computing;2024-02-26

3. HeteroCore GPU to Exploit TLP-Resource Diversity;IEEE Transactions on Parallel and Distributed Systems;2019-01-01

4. GPU NTC Process Variation Compensation With Voltage Stacking;IEEE Transactions on Very Large Scale Integration (VLSI) Systems;2018-09

5. Architectural Synthesis of Multi-SIMD Dataflow Accelerators for FPGA;IEEE Transactions on Parallel and Distributed Systems;2018-01-01