Affiliation:
1. College of Computer Science and Technology, Jilin University, Changchun, China
2. Jilin University, Changchun, China
3. the Beacom College of Computer and Cyber Sciences, Dakota State University, Madison, United States
Abstract
To satisfy prohibitively massive computational requirements of current deep Convolutional Neural Networks (CNNs), CNN-specific accelerators are widely deployed in large-scale systems. Caused by high-energy neutrons and α-particle strikes, soft error may lead to catastrophic failures when CNN is deployed on high integration density accelerators. As CNNs become ubiquitous in mission-critical domains, ensuring the reliable execution of CNN accelerators in the presence of soft errors is increasingly essential.
In this article, we propose to
Re
cycle
I
dle
P
rocessing
E
lements (PEs) in the CNN accelerator for vulnerable filters soft error detection (ReIPE). Considering the error-sensitivity of filters, ReIPE first carries out a filter-level gradient analysis process to replace fault injection for fast filter-wise error resilience estimation. Then, to achieve maximal reliability benefits, combining the hardware-level systolic array idleness and software-level CNN filter-wise error resilience profile, ReIPE preferentially duplicated loads the most vulnerable filters onto systolic array to recycle idle-column PEs for opportunistically redundant execution (error detection). Exploiting the data reuse properties of accelerators, ReIPE incorporates the error detection process into the original computation flow of accelerators to perform real-time error detection. Once the error is detected, ReIPE will trigger a correction round to rectify the erroneous output. Experimental results performed on LeNet-5, Cifar-10-CNN, AlexNet, ResNet-20, VGG-16, and ResNet-50 exhibit that ReIPE can cover 96.40% of errors while reducing 75.06% performance degradation and 67.79% energy consumption of baseline dual modular redundancy on average. Moreover, to satisfy the reliability requirements of various application scenarios, ReIPE is also applicable for pruned, quantized, and Transformer-based models, as well as portable to other accelerator architectures.
Funder
National Natural Science Foundation of China
National Key Research and Development Program of China
Graduate Innovation Fund of Jilin University
Publisher
Association for Computing Machinery (ACM)
Reference56 articles.
1. Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUs
2. Alireza Amirshahi, Joshua Alexander Harrison Klein, Giovanni Ansaloni, and David Atienza. 2023. TiC-SAT: Tightly-coupled systolic accelerator for transformers. In Proceedings of the 28th Asia and South Pacific Design Automation Conference. 657–663.
3. Arash Azizimazreah, Yongbin Gu, Xiang Gu, and Lizhong Chen. 2018. Tolerating soft errors in deep learning accelerators with reliable on-chip memory designs. In Proceedings of the 2018 IEEE International Conference on Networking, Architecture, and Storage (NAS’18). 1–10.
4. Leonardo Bautista-Gomez, Ferad Zyulkyarov, Osman Unsal, and Simon McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’16). IEEE, 645–655.
5. Cristiana Bolchini, Luca Cassano, Antonio Miele, and Alessandro Nazzari. 2022. Selective hardening of CNNs based on layer vulnerability estimation. In Proceedings of the 2022 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT’22). IEEE, 1–6.