Transient Fault Detection in Tensor Cores for Modern GPUs

Author:

Hafezan Mohammad Hassan1ORCID,Atoofian Ehsan2ORCID

Affiliation:

1. Electrical and Computer Engineering, Lakehead University, Thunder Bay, Canada

2. Electrical and Computer Engineering, Lakehead University, Thuner Bay, Canada

Abstract

Deep neural networks (DNNs) have emerged as an effective solution for many machine learning applications. However, the great success comes with the cost of excessive computation. The Volta graphics processing unit (GPU) from NVIDIA introduced a specialized hardware unit called tensor core (TC) aiming at meeting the growing computation demand needed by DNNs. Most previous studies on TCs have focused on performance improvement through the utilization of the TC's high degree of parallelism. However, as DNNs are deployed into security-sensitive applications such as autonomous driving, the reliability of TCs is as important as performance. In this work, we exploit the unique architectural characteristics of TCs and propose a simple and implementation-efficient hardware technique called fault detection in tensor core (FDTC) to detect transient faults in TCs. In particular, FDTC exploits the zero-valued weights that stem from network pruning as well as sparse activations arising from the common ReLU operator to verify tensor operations. The high level of sparsity in tensors allows FDTC to run original and verifying products simultaneously, leading to zero performance penalty. For applications with a low sparsity rate, FDTC relies on temporal redundancy to re-execute effectual products. FDTC schedules the execution of verifying products only when multipliers are idle. Our experimental results reveal that FDTC offers 100% fault coverage with no performance penalty and small energy overhead in TCs.

Publisher

Association for Computing Machinery (ACM)

Reference61 articles.

1. Optimizing NUCA organizations and wiring alternatives for large caches with CACTI 6.0;Muralimanohar N.;Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’07),2007

2. SCNN: An accelerator for compressed-sparse convolutional neural networks;Parashar Angshuman;Proceedings of the 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA ’17),2017

3. Eager Pruning: Algorithm and architecture support for fast training of deep neural networks;Zhang J.;Proceedings of the 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA ’19),2019

4. Argus-G: Comprehensive, Low-Cost Error Detection for GPGPU Cores

5. Modeling deep learning accelerator enabled GPUs;Raihan M. A.;Proceedings of the 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS ’19),2019

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3