GPUDet-Reference-Cited by-同舟云学术

GPUDet

Published:2013-04-23 Issue:4 Volume:48 Page:1-12
ISSN:0362-1340
Container-title:ACM SIGPLAN Notices
language:en
Short-container-title:SIGPLAN Not.

Author:

Jooybar Hadi¹,Fung Wilson W.L.¹,O'Connor Mike²,Devietti Joseph³,Aamodt Tor M.¹

Affiliation:

1. University of British Columbia, Vancouver, BC, Canada

2. mike.oconnor@amd.com, Austin, TX, USA

3. University of Washington, Seattle, WA, USA

Abstract

Nondeterminism is a key challenge in developing multithreaded applications. Even with the same input, each execution of a multithreaded program may produce a different output. This behavior complicates debugging and limits one's ability to test for correctness. This non-reproducibility situation is aggravated on massively parallel architectures like graphics processing units (GPUs) with thousands of concurrent threads. We believe providing a deterministic environment to ease debugging and testing of GPU applications is essential to enable a broader class of software to use GPUs. Many hardware and software techniques have been proposed for providing determinism on general-purpose multi-core processors. However, these techniques are designed for small numbers of threads. Scaling them to thousands of threads on a GPU is a major challenge. This paper proposes a scalable hardware mechanism, GPUDet, to provide determinism in GPU architectures. In this paper we characterize the existing deterministic and nondeterministic aspects of current GPU execution models, and we use these observations to inform GPUDet's design. For example, GPUDet leverages the inherent determinism of the SIMD hardware in GPUs to provide determinism within a wavefront at no cost. GPUDet also exploits the Z-Buffer Unit, an existing GPU hardware unit for graphics rendering, to allow parallel out-of-order memory writes to produce a deterministic output. Other optimizations in GPUDet include deterministic parallel execution of atomic operations and a workgroup-aware algorithm that eliminates unnecessary global synchronizations. Our simulation results indicate that GPUDet incurs only 2X slowdown on average over a baseline nondeterministic architecture, with runtime overheads as low as 4% for compute-bound applications, despite running GPU kernels with thousands of threads. We also characterize the sources of overhead for deterministic execution on GPUs to provide insights for further optimizations.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Graphics and Computer-Aided Design,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2499368.2451118

Reference46 articles.

1. http://www.ece.ubc.ca/~aamodt/GPUDet. http://www.ece.ubc.ca/~aamodt/GPUDet.

2. White Paper | AMD Graphics Cores Next (GCN) Architecture. AMD June 2012. White Paper | AMD Graphics Cores Next (GCN) Architecture. AMD June 2012.

3. Stack Trace Analysis for Large Scale Debugging

4. Analyzing CUDA workloads using a detailed GPU simulator