Real-world design and evaluation of compiler-managed GPU redundant multithreading-Reference-Cited by-同舟云学术

Real-world design and evaluation of compiler-managed GPU redundant multithreading

Published:2014-10-16 Issue:3 Volume:42 Page:73-84
ISSN:0163-5964
Container-title:ACM SIGARCH Computer Architecture News
language:en
Short-container-title:SIGARCH Comput. Archit. News

Author:

Wadden Jack¹,Lyashevsky Alexander²,Gurumurthi Sudhanva³,Sridharan Vilas⁴,Skadron Kevin¹

Affiliation:

1. University of Virginia, Charlottesville, Virginia, USA

2. AMD Research, Advanced Micro Devices, Inc., Sunnyvale, CA, USA

3. AMD Research, Advanced Micro Devices, Inc., Boxborough, MA, USA

4. RAS Architecture, Advanced Micro Devices, Inc., Boxborough, MA, USA

Abstract

Reliability for general purpose processing on the GPU (GPGPU) is becoming a weak link in the construction of reliable supercomputer systems. Because hardware protection is expensive to develop, requires dedicated on-chip resources, and is not portable across different architectures, the efficiency of software solutions such as redundant multithreading (RMT) must be explored. This paper presents a real-world design and evaluation of automatic software RMT on GPU hardware. We first describe a compiler pass that automatically converts GPGPU kernels into redundantly threaded versions. We then perform detailed power and performance evaluations of three RMT algorithms, each of which provides fault coverage to a set of structures in the GPU. Using real hardware, we show that compilermanaged software RMT has highly variable costs. We further analyze the individual costs of redundant work scheduling, redundant computation, and inter-thread communication, showing that no single component in general is responsible for high overheads across all applications; instead, certain workload properties tend to cause RMT to perform well or poorly. Finally, we demonstrate the benefit of architectural support for RMT with a specific example of fast, register-level thread communication

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.1145/2678373.2665686

Reference35 articles.

1. LLVM. {Online}. Available: http://llvm.org LLVM. {Online}. Available: http://llvm.org

2. S. Ahern A. Shoshani K.-L. Ma A. Choudhary T. Critchlow S. Klasky V. Pascucci J. Ahrens E. W. Bethel H. Childs J. Huang K. Joy Q. Koziol G. Lofstead J. S. Meredith K. Moreland G. Ostrouchov M. Papka V. Vishwanath M. Wolf N. Wright and K. Wu Scientific Discovery at the Exascale a Report from the DOE ASCR 2011 Workshop on Exascale Data Management Analysis and Visualization 2011. S. Ahern A. Shoshani K.-L. Ma A. Choudhary T. Critchlow S. Klasky V. Pascucci J. Ahrens E. W. Bethel H. Childs J. Huang K. Joy Q. Koziol G. Lofstead J. S. Meredith K. Moreland G. Ostrouchov M. Papka V. Vishwanath M. Wolf N. Wright and K. Wu Scientific Discovery at the Exascale a Report from the DOE ASCR 2011 Workshop on Exascale Data Management Analysis and Visualization 2011.

3. AMD. AMD CodeXL. Available: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/ AMD. AMD CodeXL. Available: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/codexl/

4. AMD. AMD Graphics Cores Next (GCN) Architecture. Available: http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf AMD. AMD Graphics Cores Next (GCN) Architecture. Available: http://www.amd.com/us/Documents/GCN_Architecture_whitepaper.pdf

5. AMD. OpenCL Accelerated Parallel Processing (APP) SDK. Available: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/ AMD. OpenCL Accelerated Parallel Processing (APP) SDK. Available: http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-app-sdk/downloads/

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Survey on Redundancy Based-Fault tolerance methods for Processors and Hardware accelerators - Trends in Quantum Computing, Heterogeneous Systems and Reliability;ACM Computing Surveys;2024-06-28

2. Understanding and Improving GPUs' Reliability Combining Beam Experiments with Fault Simulation;2023 IEEE International Test Conference (ITC);2023-10-07

3. Software-controlled pipeline parity in GPU architectures for error detection;Microelectronics Reliability;2023-09

4. The Encountered Problems and Solutions in the Development of Coal Mine Rescue Robot;Journal of Robotics;2018-10-24

5. Compiler Techniques to Reduce the Synchronization Overhead of GPU Redundant Multithreading;Proceedings of the 54th Annual Design Automation Conference 2017;2017-06-18