On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems-Reference-Cited by-同舟云学术

On the Safe Deployment of Matrix Multiplication in Massively Parallel Safety-Related Systems

Published:2022-04-08 Issue:8 Volume:12 Page:3779
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Fernández Javier^ORCID,Perez-Cerrolaza Jon^ORCID,Agirre Irune^ORCID,Calderon Alejandro J.^ORCID,Abella Jaume^ORCID,Cazorla Francisco J.^ORCID

Abstract

Deep learning technology has enabled the development of increasingly complex safety-related autonomous systems using high-performance computers, such as graphics processing units (GPUs), which provide the required high computing performance for the execution of parallel computing algorithms, such as matrix–matrix multiplications (a central computing element of deep learning software libraries). However, the safety certification of parallel computing software algorithms and GPU-based safety-related systems is a challenge to be addressed. For example, achieving the required fault-tolerance and diagnostic coverage for random hardware errors. This paper contributes with a safe matrix–matrix multiplication software implementation for GPUs with random hardware error-detection capabilities (permanent, transient) that can be used with different architectural patterns for fault-tolerance, and which serves as a foundation for the implementation of safe deep learning libraries for GPUs. The proposed contribution is complementary and can be combined with other techniques, such as algorithm-based fault tolerance. In particular, (i) we provide the high-performance matrix multiplication CUTLASS library with a catalog of diagnostic mechanisms to detect random hardware errors down to the arithmetic operation level; and (ii) we measure the performance impact incurred by the adoption of these mechanisms and their achievable diagnostic coverage with a set of representative matrix dimensions. To that end, we implement these algebraic operations, targeting CUDA cores with single instructions and multiple-thread math instructions in an NVIDIA Xavier NX GPU.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/12/8/3779/pdf

Reference33 articles.

1. A Survey of Deep Learning-Based Object Detection

2. Deep Collaborative Attention Network for Hyperspectral Image Classification by Combining 2-D CNN and 3-D CNN

3. Analyzing the Reliability of Convolutional Neural Networks on GPUs: GoogLeNet as a Case Study

4. YOLOv3: An Incremental Improvement;Redmon;arXiv,2018

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improving Timing-Related Guarantees for Main Memory in Multicore Critical Embedded Systems;2023 IEEE Real-Time Systems Symposium (RTSS);2023-12-05

2. A Methodology for Selective Protection of Matrix Multiplications: A Diagnostic Coverage and Performance Trade-off for CNNs Executed on GPUs;2022 6th International Conference on System Reliability and Safety (ICSRS);2022-11-23