Pagoda-Reference-Cited by-同舟云学术

Pagoda

Published:2019-12-26 Issue:4 Volume:6 Page:1-23
ISSN:2329-4949
Container-title:ACM Transactions on Parallel Computing
language:en
Short-container-title:ACM Trans. Parallel Comput.

Author:

Yeh Tsung Tai¹,Sabne Amit²,Sakdhnagool Putt³,Eigenmann Rudolf⁴,Rogers Timothy G.⁵

Affiliation:

1. Purdue University, West Lafayette, IN, USA

2. Microsoft

3. National Electronics and Computer Technology Center

4. University of Delaware

5. Purdue University

Abstract

Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the their hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. This article presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 5.52X over PThreads running on a 20-core CPU, 1.76X over CUDA-HyperQ, and 1.44X over GeMTC, the state-of-the-art runtime GPU task scheduling system.

Publisher

Association for Computing Machinery (ACM)

Subject

Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modelling and Simulation,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3365657

Reference42 articles.

1. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

2. A virtual memory based runtime to support multi-tenancy in clusters with GPUs

3. Barcelona OpenMP Tasks Suite: A Set of Benchmarks Targeting the Exploitation of Task Parallelism in OpenMP

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. GhOST: a GPU Out-of-Order Scheduling Technique for Stall Reduction;2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA);2024-06-29

2. ROSGM: A Real-Time GPU Management Framework with Plug-In Policies for ROS 2;2023 IEEE 29th Real-Time and Embedded Technology and Applications Symposium (RTAS);2023-05

3. Dynamic GPU Scheduling with Multi-resource Awareness and Live Migration Support;IEEE Transactions on Cloud Computing;2023

4. A Comprehensive Survey on Training Acceleration for Large Machine Learning Models in IoT;IEEE Internet of Things Journal;2022-01-15