Device Hopping-Reference-Cited by-同舟云学术

Device Hopping

Published:2021-12-31 Issue:4 Volume:18 Page:1-25
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Metzger Paul¹^ORCID,Seeker Volker¹,Fensch Christian¹,Cole Murray¹

Affiliation:

1. School of Informatics, University of Edinburgh, Edinburgh, United Kingdom

Abstract

Existing OS techniques for homogeneous many-core systems make it simple for single and multithreaded applications to migrate between cores. Heterogeneous systems do not benefit so fully from this flexibility, and applications that cannot migrate in mid-execution may lose potential performance. The situation is particularly challenging when a switch of language runtime would be desirable in conjunction with a migration. We present a case study in making heterogeneous CPU + GPU systems more flexible in this respect. Our technique for fine-grained application migration, allows switches between OpenMP, OpenCL, and CUDA execution, in conjunction with migrations from GPU to CPU, and CPU to GPU. To achieve this, we subdivide iteration spaces into slices, and consider migration on a slice-by-slice basis. We show that slice sizes can be learned offline by machine learning models. To further improve performance, memory transfers are made migration-aware. The complexity of the migration capability is hidden from programmers behind a high-level programming model. We present a detailed evaluation of our mid-kernel migration mechanism with the First Come, First Served scheduling policy. We compare our technique in a focused evaluation scenario against idealized kernel-by-kernel scheduling, which is typical for current systems, and makes perfect kernel to device scheduling decisions, but cannot migrate kernels mid-execution. Models show that up to 1.33× speedup can be achieved over these systems by adding fine-grained migration. Our experimental results with all nine applicable SHOC and Rodinia benchmarks achieve speedups of up to 1.30× (1.08× on average) over an implementation of a perfect but kernel-migration incapable scheduler when migrated to a faster device. Our mechanism and slice size choices introduce an average slowdown of only 2.44% if kernels never migrate. Lastly, our programming model reduces the code size by at least 88% if compared to manual implementations of migratable kernels.

Funder

EPSRC Centre for Doctoral Training in Pervasive Parallelism

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3471909

Reference53 articles.

1. StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures

2. Supporting Preemptive Task Executions and Memory Copies in GPGPUs

3. OpenMP Architecture Review Board. 2020. OpenMP Application Programming Interface. Version 5.1. OpenMP Architecture Review Board. 2020. OpenMP Application Programming Interface. Version 5.1.