Automatic data allocation and buffer management for multi-GPU machines-Reference-Cited by-同舟云学术

Automatic data allocation and buffer management for multi-GPU machines

Published:2013-12 Issue:4 Volume:10 Page:1-26
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Ramashekar Thejas¹,Bondhugula Uday¹

Affiliation:

1. Indian Institute of Science, Karnataka, India

Abstract

Multi-GPU machines are being increasingly used in high-performance computing. Each GPU in such a machine has its own memory and does not share the address space either with the host CPU or other GPUs. Hence, applications utilizing multiple GPUs have to manually allocate and manage data on each GPU. Existing works that propose to automate data allocations for GPUs have limitations and inefficiencies in terms of allocation sizes, exploiting reuse, transfer costs, and scalability. We propose a scalable and fully automatic data allocation and buffer management scheme for affine loop nests on multi-GPU machines. We call it the Bounding-Box-based Memory Manager (BBMM). BBMM can perform at runtime , during standard set operations like union, intersection, and difference, finding subset and superset relations on hyperrectangular regions of array data (bounding boxes). It uses these operations along with some compiler assistance to identify, allocate, and manage data required by applications in terms of disjoint bounding boxes. This allows it to (1) allocate exactly or nearly as much data as is required by computations running on each GPU, (2) efficiently track buffer allocations and hence maximize data reuse across tiles and minimize data transfer overhead, and (3) and as a result, maximize utilization of the combined memory on multi-GPU machines. BBMM can work with any choice of parallelizing transformations, computation placement, and scheduling schemes, whether static or dynamic. Experiments run on a four-GPU machine with various scientific programs showed that BBMM reduces data allocations on each GPU by up to 75% compared to current allocation schemes, yields performance of at least 88% of manually written code, and allows excellent weak scaling.

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2544100

Reference34 articles.

1. Using integer sets for data-parallel program analysis and optimization

2. Communication optimization and code generation for distributed memory machines

3. Augonnet C. Thibault S. Namyst R. and Wacrenier P. 2009. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. In Concurrency and Computation: Practice and Experience. 10.1002/cpe.1631 Augonnet C. Thibault S. Namyst R. and Wacrenier P. 2009. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. In Concurrency and Computation: Practice and Experience. 10.1002/cpe.1631

4. Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories

5. Automatic C-to-CUDA Code Generation for Affine Programs

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Advances of Pipeline Model Parallelism for Deep Learning Training: An Overview;Journal of Computer Science and Technology;2024-05

2. Efficient Job Offloading in Heterogeneous Systems Through Hardware-Assisted Packet-Based Dispatching and User-Level Runtime Infrastructure;IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems;2020-05

3. OmpMemOpt: Optimized Memory Movement for Heterogeneous Computing;Euro-Par 2020: Parallel Processing;2020

4. HiWayLib;Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems;2019-04-04

5. CODA;ACM Transactions on Architecture and Code Optimization;2018-09-30