Affiliation:
1. Indian Institute of Science, Karnataka, India
Abstract
Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability.
We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at
runtime
, during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of
disjoint
bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview;Journal of Computer Science and Technology;2024-05
2. Efficient Job Offloading in Heterogeneous Systems Through Hardware-Assisted Packet-Based Dispatching and User-Level Runtime Infrastructure;IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems;2020-05
3. OmpMemOpt: Optimized Memory Movement for Heterogeneous Computing;Euro-Par 2020: Parallel Processing;2020
4. HiWayLib;Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems;2019-04-04
5. CODA;ACM Transactions on Architecture and Code Optimization;2018-09-30