Highly Concurrent Latency-tolerant Register Files for GPUs-Reference-Cited by-同舟云学术

Highly Concurrent Latency-tolerant Register Files for GPUs

Published:2019-11-30 Issue:1-4 Volume:37 Page:1-36
ISSN:0734-2071
Container-title:ACM Transactions on Computer Systems
language:en
Short-container-title:ACM Trans. Comput. Syst.

Author:

Sadrosadati Mohammad¹^ORCID,Mirhosseini Amirhossein²,Hajiabadi Ali³,Ehsani Seyed Borna³,Falahati Hajar¹,Sarbazi-Azad Hamid⁴,Drumond Mario⁵,Falsafi Babak⁵,Ausavarungnirun Rachata⁶,Mutlu Onur⁷

Affiliation:

1. Institute for Research in Fundamental Sciences (IPM)

2. University of Michigan

3. Sharif University of Technology

4. Sharif University of Technology and Institute for Research in Fundamental Sciences (IPM)

5. EPFL

6. CMU

7. ETH Zürich and CMU

Abstract

Graphics Processing Units (GPUs) employ large register files to accommodate all active threads and accelerate context switching. Unfortunately, register files are a scalability bottleneck for future GPUs due to long access latency, high power consumption, and large silicon area provisioning. Prior work proposes hierarchical register file to reduce the register file power consumption by caching registers in a smaller register file cache. Unfortunately, this approach does not improve register access latency due to the low hit rate in the register file cache. In this article, we propose the Latency-Tolerant Register File (LTRF) architecture to achieve low latency in a two-level hierarchical structure while keeping power consumption low. We observe that compile-time interval analysis enables us to divide GPU program execution into intervals with an accurate estimate of a warp’s aggregate register working-set within each interval. The key idea of LTRF is to prefetch the estimated register working-set from the main register file to the register file cache under software control, at the beginning of each interval, and overlap the prefetch latency with the execution of other warps. We observe that register bank conflicts while prefetching the registers could greatly reduce the effectiveness of LTRF. Therefore, we devise a compile-time register renumbering technique to reduce the likelihood of register bank conflicts. Our experimental results show that LTRF enables high-capacity yet long-latency main GPU register files, paving the way for various optimizations. As an example optimization, we implement the main register file with emerging high-density high-latency memory technologies, enabling 8× larger capacity and improving overall GPU performance by 34%.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3419973

Reference200 articles.

1. Mohammad Abdel-Majeed Alireza Shafaei Hyeran Jeon Massoud Pedram and Murali Annavaram. 2017. Pilot register file: Energy efficient partitioned register file for GPUs. In HPCA. Mohammad Abdel-Majeed Alireza Shafaei Hyeran Jeon Massoud Pedram and Murali Annavaram. 2017. Pilot register file: Energy efficient partitioned register file for GPUs. In HPCA.

2. Junwhan Ahn Sungpack Hong Sungjoo Yoo Onur Mutlu and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In ISCA. Junwhan Ahn Sungpack Hong Sungjoo Yoo Onur Mutlu and Kiyoung Choi. 2015. A scalable processing-in-memory accelerator for parallel graph processing. In ISCA.

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Cross-core Data Sharing for Energy-efficient GPUs;ACM Transactions on Architecture and Code Optimization;2024-09-14

2. CV32RT: Enabling Fast Interrupt and Context Switching for RISC-V Microcontrollers;IEEE Transactions on Very Large Scale Integration (VLSI) Systems;2024-06

3. PresCount: Effective Register Allocation for Bank Conflict Reduction;2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO);2024-03-02

4. Snake: A Variable-length Chain-based Prefetching for GPUs;56th Annual IEEE/ACM International Symposium on Microarchitecture;2023-10-28

5. OSM: Off-Chip Shared Memory for GPUs;IEEE Transactions on Parallel and Distributed Systems;2022-12-01