CODA-Reference-Cited by-同舟云学术

CODA

Published:2018-09-30 Issue:3 Volume:15 Page:1-23
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Kim Hyojong¹,Hadidi Ramyad¹^ORCID,Nai Lifeng¹^ORCID,Kim Hyesoon¹,Jayasena Nuwan²,Eckert Yasuko²,Kayiran Onur²,Loh Gabriel²

Affiliation:

1. Georgia Institute of Technology

2. Advanced Micro Devices, Inc.

Abstract

To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach. To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3232521

Reference61 articles.

1. A scalable processing-in-memory accelerator for parallel graph processing

2. PIM-enabled instructions

3. Data reorganization in memory using 3D-stacked DRAM

4. MCM-GPU

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

2. Salus: Efficient Security Support for CXL-Expanded GPU Memory;2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2024-03-02

3. Characterizing Multi-Chip GPU Data Sharing;ACM Transactions on Architecture and Code Optimization;2023-12-14

4. FILL: a heterogeneous resource scheduling system addressing the low throughput problem in GROMACS;CCF Transactions on High Performance Computing;2023-09-23

5. Spica: Exploring FPGA Optimizations to Enable an Efficient SpMV Implementation for Computations at Edge;2023 IEEE International Conference on Edge Computing and Communications (EDGE);2023-07