ReIPE: Recycling Idle PEs in CNN Accelerator for Vulnerable Filters Soft-Error Detection

Author:

Wei Xiaohui1ORCID,Wang Chenyang1ORCID,Yue Hengshan1ORCID,Tan Jingweijia1ORCID,Guan Zeyu2ORCID,Jiang Nan1ORCID,Zheng Xinyang1ORCID,Zhao Jianpeng1ORCID,Qiu Meikang3ORCID

Affiliation:

1. College of Computer Science and Technology, Jilin University, Changchun, China

2. Jilin University, Changchun, China

3. the Beacom College of Computer and Cyber Sciences, Dakota State University, Madison, United States

Abstract

To satisfy prohibitively massive computational requirements of current deep Convolutional Neural Networks (CNNs), CNN-specific accelerators are widely deployed in large-scale systems. Caused by high-energy neutrons and α-particle strikes, soft error may lead to catastrophic failures when CNN is deployed on high integration density accelerators. As CNNs become ubiquitous in mission-critical domains, ensuring the reliable execution of CNN accelerators in the presence of soft errors is increasingly essential. In this article, we propose to Re cycle I dle P rocessing E lements (PEs) in the CNN accelerator for vulnerable filters soft error detection (ReIPE). Considering the error-sensitivity of filters, ReIPE first carries out a filter-level gradient analysis process to replace fault injection for fast filter-wise error resilience estimation. Then, to achieve maximal reliability benefits, combining the hardware-level systolic array idleness and software-level CNN filter-wise error resilience profile, ReIPE preferentially duplicated loads the most vulnerable filters onto systolic array to recycle idle-column PEs for opportunistically redundant execution (error detection). Exploiting the data reuse properties of accelerators, ReIPE incorporates the error detection process into the original computation flow of accelerators to perform real-time error detection. Once the error is detected, ReIPE will trigger a correction round to rectify the erroneous output. Experimental results performed on LeNet-5, Cifar-10-CNN, AlexNet, ResNet-20, VGG-16, and ResNet-50 exhibit that ReIPE can cover 96.40% of errors while reducing 75.06% performance degradation and 67.79% energy consumption of baseline dual modular redundancy on average. Moreover, to satisfy the reliability requirements of various application scenarios, ReIPE is also applicable for pruned, quantized, and Transformer-based models, as well as portable to other accelerator architectures.

Funder

National Natural Science Foundation of China

National Key Research and Development Program of China

Graduate Innovation Fund of Jilin University

Publisher

Association for Computing Machinery (ACM)

Reference56 articles.

1. Impact of Voltage Scaling on Soft Errors Susceptibility of Multicore Server CPUs

2. Alireza Amirshahi, Joshua Alexander Harrison Klein, Giovanni Ansaloni, and David Atienza. 2023. TiC-SAT: Tightly-coupled systolic accelerator for transformers. In Proceedings of the 28th Asia and South Pacific Design Automation Conference. 657–663.

3. Arash Azizimazreah, Yongbin Gu, Xiang Gu, and Lizhong Chen. 2018. Tolerating soft errors in deep learning accelerators with reliable on-chip memory designs. In Proceedings of the 2018 IEEE International Conference on Networking, Architecture, and Storage (NAS’18). 1–10.

4. Leonardo Bautista-Gomez, Ferad Zyulkyarov, Osman Unsal, and Simon McIntosh-Smith. 2016. Unprotected computing: A large-scale study of DRAM raw error rate on a supercomputer. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC’16). IEEE, 645–655.

5. Cristiana Bolchini, Luca Cassano, Antonio Miele, and Alessandro Nazzari. 2022. Selective hardening of CNNs based on layer vulnerability estimation. In Proceedings of the 2022 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT’22). IEEE, 1–6.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3