Affiliation:
1. Computer Architecture Department, Universitat Politècnica de Catalunya-BarcelonaTECH, Barcelona, Spain
Abstract
Hybrid computer systems combine compute units (CUs) of different nature like CPUs, GPUs and FPGAs. Simultaneously exploiting the computing power of these CUs requires a careful decomposition of the applications into balanced parallel tasks according to both the performance of each CU type and the communication costs among them. This paper describes the design and implementation of runtime support for OpenMP hybrid GPU-CPU applications, when mixed with GPU-oriented programming models (e.g. CUDA/HIP). The paper describes the case for a hybrid multi-level parallelization of the NPB-MZ benchmark suite. The implementation exploits both coarse-grain and fine-grain parallelism, mapped to compute units of different nature (GPUs and CPUs). The paper describes the implementation of runtime support to bridge OpenMP and HIP, introducing the abstractions of Computing Unit and Data Placement. We compare hybrid and non-hybrid executions under state-of-the-art schedulers for OpenMP: static and dynamic task schedulings. Then, we improve the set of schedulers with two additional variants: a memorizing-dynamic task scheduling and a profile-based static task scheduling. On a computing node composed of one AMD EPYC 7742 @ 2.250 GHz (64 cores and 2 threads/core, totalling 128 threads per node) and 2 × GPU AMD Radeon Instinct MI50 with 32 GB, hybrid executions present speedups from 1.10× up to 3.5× with respect to a non-hybrid GPU implementation, depending on the number of activated CUs.
Funder
Spanish Ministry of Science and Technology
Subject
Hardware and Architecture,Theoretical Computer Science,Software
Reference56 articles.
1. Abadi M, Agarwal A, Barham P, et al. (2015) TensorFlow: large-scale machine learning on heterogeneous distributed systems. http://download.tensorflow.org/paper/whitepaper2015.pdf.
2. Augonnet C, Thibault S, Namyst R (2010) StarPU: a runtime system for scheduling tasks over accelerator-based multicore machines. Research Report RR-7240, INRIA. https://hal.inria.fr/inria-00467677.
3. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
4. Optimized large-message broadcast for deep learning workloads: MPI, MPI+NCCL, or NCCL2?
5. The Nas Parallel Benchmarks
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Heterogeneous Intra-Pipeline Device-Parallel Aggregations;Proceedings of the 20th International Workshop on Data Management on New Hardware;2024-06-09
2. Comparative Study on Serial and Parallel Implementation of Face Detection;2024 International Conference on Intelligent and Innovative Technologies in Computing, Electrical and Electronics (IITCEE);2024-01-24