Affiliation:
1. Northeastern University, Boston, MA
Abstract
The vulnerability of GPUs to soft errors has become a first-class design concern as they are increasingly being used in accuracy-sensitive and safety-critical domains. Existing solutions used to enhance the reliability of GPUs come with significant overhead in terms of area, power, and/or performance. In this article, we propose ArmorAll, a light-weight, adaptive, selective, and portable software solution to protect GPUs against soft errors. ArmorAll consists of a set of purely compiler-based redundancy schemes designed to optimize instruction duplication on GPUs, thereby enabling much more reliable execution. The choice of the scheme determines the subset of instructions that must be duplicated in an application, allowing adaptable fault coverage for different applications. ArmorAll can intelligently select a redundancy scheme that provides the best coverage to an application with an accuracy of 91.7%. The high coverage provided by ArmorAll comes at an average improvement of 64.5% in runtime when using the selected redundancy scheme as compared to the state-of-the-art.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference53 articles.
1. [n.d.]. Enabling on-the-fly manipulations with LLVM IR code of CUDA sources. Retrieved from https://github.com/apc-llc/nvcc-llvm-ir. [n.d.]. Enabling on-the-fly manipulations with LLVM IR code of CUDA sources. Retrieved from https://github.com/apc-llc/nvcc-llvm-ir.
2. Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration
3. Commercial fault tolerance: a tale of two systems
4. FailAmp: Relativization transformation for soft error detection in structured address generation;Briggs Ian;ACM Transactions on Architecture and Code Optimization,2019
Cited by
25 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献