Assessing the Impact of Compiler Optimizations on GPUs Reliability-Reference-Cited by-同舟云学术

Assessing the Impact of Compiler Optimizations on GPUs Reliability

Published:2024-01-12 Issue: Volume: Page:
ISSN:1544-3566
Container-title:ACM Transactions on Architecture and Code Optimization
language:en
Short-container-title:ACM Trans. Archit. Code Optim.

Author:

Santos Fernando Fernandes dos¹,Carro Luigi²,Vella Flavio³,Rech Paolo³

Affiliation:

1. Univ Rennes, INRIA, Rennes, France

2. Institute of Informatics, Federal University of Rio Grande do Sul, Brazil

3. University of Trento, Italy

Abstract

Graphics Processing Units (GPUs) compilers have evolved in order to support general-purpose programming languages for multiple architectures. NVIDIA CUDA Compiler (NVCC) has many compilation levels before generating the machine code and applies complex optimizations to improve performance. These optimizations modify how the software is mapped in the underlying hardware; thus, as we show in this paper, they can also affect GPU reliability. We evaluate the effects on the GPU error rate of the optimization flags applied at the NVCC Parallel Thread Execution (PTX) compiling phase by analyzing two NVIDIA GPU architectures (Kepler and Volta) and two compiler versions (NVCC 10.2 and 11.3). We compare and combine fault propagation analysis based on software fault injection, hardware utilization distribution obtained with application-level profiling, and machine instructions radiation-induced error rate measured with beam experiments. We consider eight different workloads and 144 combinations of compilation flags, and we show that optimizations can impact the GPUs’ error rate of up to an order of magnitude. Additionally, through accelerated neutron beam experiments on a NVIDIA Kepler GPU, we show that the error rate of the unoptimized GEMM (-O0 flag) is lower than the optimized GEMM’s (-O3 flag) error rate. When the performance is evaluated together with the error rate, we show that the most optimized versions (-O1 and -O3) always produce a higher amount of correct data than the unoptimized code (-O0).

Publisher

Association for Computing Machinery (ACM)

Subject

Hardware and Architecture,Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/3638249

Reference56 articles.

1. Abdul Rehman Anwer , Guanpeng Li , Karthik Pattabiraman , Michael Sullivan , Timothy Tsai , and Siva Kumar Sastry Hari . 2020 . GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs . In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis ( Atlanta, Georgia) (SC ’20). IEEE Press, Article 88, 15 pages. Abdul Rehman Anwer, Guanpeng Li, Karthik Pattabiraman, Michael Sullivan, Timothy Tsai, and Siva Kumar Sastry Hari. 2020. GPU-Trident: Efficient Modeling of Error Propagation in GPU Programs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC ’20). IEEE Press, Article 88, 15 pages.

2. R. A. Ashraf , R. Gioiosa , G. Kestor , and R. F. DeMara . 2017 . Exploring the Effect of Compiler Optimizations on the Reliability of HPC Applications. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1274–1283 . R. A. Ashraf, R. Gioiosa, G. Kestor, and R. F. DeMara. 2017. Exploring the Effect of Compiler Optimizations on the Reliability of HPC Applications. In 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). 1274–1283.

3. Comparison of parallel implementation strategies in GPU-accelerated System-on-Chip under proton irradiation

4. Soft Errors in Advanced Computer Systems

5. Multilevel Parallelism for the Exploration of Large-Scale Graphs