Affiliation:
1. Northeastern University, Boston, MA
Abstract
The vulnerability of GPUs to soft errors has become a first-class design concern as they are increasingly being used in accuracy-sensitive and safety-critical domains. Existing solutions used to enhance the reliability of GPUs come with significant overhead in terms of area, power, and/or performance. In this article, we propose ArmorAll, a light-weight, adaptive, selective, and portable software solution to protect GPUs against soft errors. ArmorAll consists of a set of purely compiler-based redundancy schemes designed to optimize instruction duplication on GPUs, thereby enabling much more reliable execution. The choice of the scheme determines the subset of instructions that must be duplicated in an application, allowing adaptable fault coverage for different applications. ArmorAll can intelligently select a redundancy scheme that provides the best coverage to an application with an accuracy of 91.7%. The high coverage provided by ArmorAll comes at an average improvement of 64.5% in runtime when using the selected redundancy scheme as compared to the state-of-the-art.
Publisher
Association for Computing Machinery (ACM)
Subject
Hardware and Architecture,Information Systems,Software
Reference53 articles.
1. [n.d.]. Enabling on-the-fly manipulations with LLVM IR code of CUDA sources. Retrieved from https://github.com/apc-llc/nvcc-llvm-ir. [n.d.]. Enabling on-the-fly manipulations with LLVM IR code of CUDA sources. Retrieved from https://github.com/apc-llc/nvcc-llvm-ir.
2. Parallel computing with graphics processing units for high-speed Monte Carlo simulation of photon migration
3. Commercial fault tolerance: a tale of two systems
4. FailAmp: Relativization transformation for soft error detection in structured address generation;Briggs Ian;ACM Transactions on Architecture and Code Optimization,2019
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Demystifying and Mitigating Cross-Layer Deficiencies of Soft Error Protection in Instruction Duplication;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11
2. Characterizing Runtime Performance Variation in Error Detection by Duplicating Instructions;2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE);2023-10-09
3. Software-controlled pipeline parity in GPU architectures for error detection;Microelectronics Reliability;2023-09
4. Evaluating an XOR-based Hybrid Fault Tolerance Technique to Detect Faults in GPU Pipelines;2023 IEEE Computer Society Annual Symposium on VLSI (ISVLSI);2023-06-20
5. Investigating the Impact of High-Level Software Design on Low-Level Hardware Fault Resilience;2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks - Supplemental Volume (DSN-S);2023-06