A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems-Reference-Cited by-同舟云学术

A multithreaded CUDA and OpenMP based power‐aware programming framework for multi‐node GPU systems

Published:2023-08-29 Issue:25 Volume:35 Page:
ISSN:1532-0626
Container-title:Concurrency and Computation: Practice and Experience
language:en
Short-container-title:Concurrency and Computation

Author:

Czarnul Paweł¹^ORCID

Affiliation:

1. Department of Computer Architecture, Faculty of Electronics, Telecommunications and Informatics Gdansk University of Technology Gdansk Poland

Abstract

SummaryIn the article, we have proposed a framework that allows programming a parallel application for a multi‐node system, with one or more graphical processing units (GPUs) per node, using an OpenMP+extended CUDA API. OpenMP is used for launching threads responsible for management of particular GPUs and extended CUDA calls allow to transfer data and launch kernels on local and remote GPUs. The framework hides inter‐node MPI communication from the programmer. For optimization, the implementation takes advantage of the MPI_THREAD_MULTIPLE mode allowing: multiple threads handling distinct GPUs as well as overlapping communication and computations transparently using multiple CUDA streams. The solution allows data parallelization across available GPUs in order to minimize execution time and supports a power‐aware mode in which GPUs are automatically selected for computations using a greedy approach in order not to exceed an imposed power limit. We have implemented and benchmarked three parallel applications including: finding the largest divisors; verification of the Collatz conjecture; finding patterns in vectors. These were tested on three various systems: a GPU cluster with 16 nodes, each with NVIDIA GTX 1060 GPU; a powerful 2‐node system—one node with 8 NVIDIA Quadro RTX 6000 GPUs, the second with 4 NVIDIA Quadro RTX 5000 GPUs; a heterogeneous environment with one node with 2 NVIDIA RTX 2080 and 2 nodes with NVIDIA GTX 1060 GPUs. We demonstrated effectiveness of the framework through execution times versus power caps within ranges of 100–1400 W, 250–3000 W, and 125–600 W for these systems respectively as well as gains from using two versus one CUDA streams per GPU. Finally, we have shown that for the testbed applications the solution allows to obtain high speed‐ups between 89.3% and 97.4% of the theoretically assessed ideal ones, for 16 nodes and 2 CUDA streams, demonstrating very good parallel efficiency.

Publisher

Wiley

Subject

Computational Theory and Mathematics,Computer Networks and Communications,Computer Science Applications,Theoretical Computer Science,Software

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/cpe.7897

Reference41 articles.

1. Parallel Programming for Modern High Performance Computing Systems

2. Experience Report: Writing a Portable GPU Runtime with OpenMP 5.1

3. Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU clusters

4. Survey of Methodologies, Approaches, and Challenges in Parallel Programming Using High-Performance Computing Systems

5. Energy-Aware Scheduling for High-Performance Computing Systems: A Survey

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Special Issue on the pervasive nature of HPC (PN‐HPC);Concurrency and Computation: Practice and Experience;2024-01-14