Affiliation:
1. Georgia Institute of Technology
2. Advanced Micro Devices, Inc.
Abstract
To exploit parallelism and scalability of multiple GPUs in a system, it is critical to place compute and data together. However, two key techniques that have been used to hide memory latency and improve thread-level parallelism (TLP), memory interleaving, and thread block scheduling, in traditional GPU systems are at odds with efficient use of multiple GPUs. Distributing data across multiple GPUs to improve overall memory bandwidth utilization incurs high remote traffic when the data and compute are misaligned. Nondeterministic thread block scheduling to improve compute resource utilization impedes co-placement of compute and data. Our goal in this work is to enable co-placement of compute and data in the presence of fine-grained interleaved memory with a low-cost approach.
To this end, we propose a mechanism that identifies exclusively accessed data and place the data along with the thread block that accesses it in the same GPU. The key ideas are (1) the amount of data exclusively used by a thread block can be estimated, and that exclusive data (of any size) can be localized to one GPU with coarse-grained interleaved pages; (2) using the affinity-based thread block scheduling policy, we can co-place compute and data together; and (3) by using dual address mode with lightweight changes to virtual to physical page mappings, we can selectively choose different interleaved memory pages for each data structure. Our evaluations across a wide range of workloads show that the proposed mechanism improves performance by 31% and reduces 38% remote traffic over a baseline system.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Cited by
16 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29
2. Salus: Efficient Security Support for CXL-Expanded GPU Memory;2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2024-03-02
3. Characterizing Multi-Chip GPU Data Sharing;ACM Transactions on Architecture and Code Optimization;2023-12-14
4. FILL: a heterogeneous resource scheduling system addressing the low throughput problem in GROMACS;CCF Transactions on High Performance Computing;2023-09-23
5. Spica: Exploring FPGA Optimizations to Enable an Efficient SpMV Implementation for Computations at Edge;2023 IEEE International Conference on Edge Computing and Communications (EDGE);2023-07