Cooperative Caching for GPUs-Reference-Cited by-同舟云学术

Cooperative Caching for GPUs

Published:2016-12-28 Issue:4 Volume:13 Page:1-25
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Dublish Saumay¹,Nagarajan Vijay¹,Topham Nigel¹

Affiliation:

1. University of Edinburgh, UK

Abstract

The rise of general-purpose computing on GPUs has influenced architectural innovation on them. The introduction of an on-chip cache hierarchy is one such innovation. High L1 miss rates on GPUs, however, indicate inefficient cache usage due to myriad factors, such as cache thrashing and extensive multithreading. Such high L1 miss rates in turn place high demands on the shared L2 bandwidth. Extensive congestion in the L2 access path therefore results in high memory access latencies. In memory-intensive applications, these latencies get exposed due to a lack of active compute threads to mask such high latencies. In this article, we aim to reduce the pressure on the shared L2 bandwidth, thereby reducing the memory access latencies that lie in the critical path. We identify significant replication of data among private L1 caches, presenting an opportunity to reuse data among L1s. We further show how this reuse can be exploited via an L1 Cooperative Caching Network (CCN), thereby reducing the bandwidth demand on L2. In the proposed architecture, we connect the L1 caches with a lightweight ring network to facilitate intercore communication of shared data. We show that this technique reduces traffic to the L2 cache by an average of 29%, freeing up the bandwidth for other accesses. We also show that the CCN reduces the average memory latency by 24%, thereby reducing core stall cycles by 26% on average. This translates into an overall performance improvement of 14.7% on average (and up to 49%) for applications that exhibit reuse across L1 caches. In doing so, the CCN incurs a nominal area and energy overhead of 1.3% and 2.5%, respectively. Notably, the performance improvement with our proposed CCN compares favorably to the performance improvement achieved by simply doubling the number of L2 banks by up to 34%.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3001589

Reference44 articles.

1. GPU Concurrency

2. Piranha

3. Managing Wire Delay in Large Chip-Multiprocessor Caches

Cited by 20 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Cross-core Data Sharing for Energy-efficient GPUs;ACM Transactions on Architecture and Code Optimization;2024-09-14

2. A Survey of Caching Techniques for General Purpose Graphics Processing Units;2024 3rd International Conference for Innovation in Technology (INOCON);2024-03-01

3. Collaborative Coalescing of Redundant Memory Access for GPU System;2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC);2024-01-22

4. Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs;2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT);2023-10-21

5. COLAB;Proceedings of the 28th Asia and South Pacific Design Automation Conference;2023-01-16