Hybrid CPU-GPU scheduling and execution of tree traversals-Reference-Cited by-同舟云学术

Hybrid CPU-GPU scheduling and execution of tree traversals

Published:2016-11-09 Issue:8 Volume:51 Page:1-2
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Liu Jianqiao¹,Hegde Nikhil¹,Kulkarni Milind¹

Affiliation:

1. Purdue University

Abstract

GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in mapping irregular applications to GPUs: applications with unpredictable, data-dependent behaviors. While most of the work in this space has focused on ad hoc implementations of specific algorithms, recent work has looked at generic techniques for mapping a large class of tree traversal algorithms to GPUs, through careful restructuring of the tree traversal algorithms to make them behave more regularly. Unfortunately, even this general approach for GPU execution of tree traversal algorithms is reliant on ad hoc , handwritten, algorithm-specific scheduling ( i.e. , assignment of threads to warps) to achieve high performance. The key challenge of scheduling is that it is a highly irregular process, that requires the inspection of thread behavior and then careful sorting of the threads into warps. In this paper, we present a novel scheduling and execution technique for tree traversal algorithms that is both general and automatic. The key novelty is a hybrid approach: the GPU partially executes tasks to inspect thread behavior and transmits information back to the CPU, which uses that information to perform the scheduling itself, before executing the remaining, carefully scheduled, portion of the traversals on the GPU. We applied this framework to five tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses hand-tuned, application-specific scheduling.

Funder

U.S. Department of Energy

National Science Foundation

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3016078.2851174

Reference6 articles.

1. An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-Body Algorithm

2. General transformations for GPU execution of tree traversals

3. Realtime Ray Tracing on GPU with BVH-based Packet Traversal

4. Efficient stack-less BVH traversal for ray tracing

5. A GPU implementation of inclusion-based points-to analysis

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Dynamic SIMD Parallel Execution on GPU from High-Level Dataflow Synthesis;Journal of Low Power Electronics and Applications;2022-07-17

2. FastZ;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2021-11-13

3. Locality-Aware Task-Parallel Execution on GPUs;Languages and Compilers for Parallel Computing;2017