Runtime support for CPU-GPU high-performance computing on distributed memory platforms-Reference-Cited by-同舟云学术

Runtime support for CPU-GPU high-performance computing on distributed memory platforms

Published:2024-07-19 Issue: Volume:2 Page:
ISSN:2813-7337
Container-title:Frontiers in High Performance Computing
language:
Short-container-title:Front. High Perform. Comput.

Author:

Thomadakis Polykarpos,Chrisochoides Nikos

Abstract

IntroductionHardware heterogeneity is here to stay for high-performance computing. Large-scale systems are currently equipped with multiple GPU accelerators per compute node and are expected to incorporate more specialized hardware. This shift in the computing ecosystem offers many opportunities for performance improvement; however, it also increases the complexity of programming for such architectures.MethodsThis work introduces a runtime framework that enables effortless programming for heterogeneous systems while efficiently utilizing hardware resources. The framework is integrated within a distributed and scalable runtime system to facilitate performance portability across heterogeneous nodes. Along with the design, this paper describes the implementation and optimizations performed, achieving up to 300% improvement on a single device and linear scalability on a node equipped with four GPUs.ResultsThe framework in a distributed memory environment offers portable abstractions that enable efficient inter-node communication among devices with varying capabilities. It delivers superior performance compared to MPI+CUDA by up to 20% for large messages while keeping the overheads for small messages within 10%. Furthermore, the results of our performance evaluation in a distributed Jacobi proxy application demonstrate that our software imposes minimal overhead and achieves a performance improvement of up to 40%.DiscussionThis is accomplished by the optimizations at the library level and by creating opportunities to leverage application-specific optimizations like over-decomposition.

Publisher

Frontiers Media SA

Reference47 articles.

1. Position Papers for the ASCR Workshop on Reimagining Codesign

2. “Data parallel c++: enhancing sycl through extensions for productivity and performance,”;Ashbaugh,2020

3. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures;Augonnet;Concurr. Comput,2011

4. “A novel dynamic load balancing library for cluster computing,”;Balasubramaniam;Proceedings 3rd International Symposium on Parallel and Distributed Computing,2004

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Evaluating ARM and RISC-V Architectures for High-Performance Computing with Docker and Kubernetes;Electronics;2024-09-03