1. Measuring and Understanding Extreme-Scale Application Resilience: A Field Study of 5,000,000 HPC Application Runs;Di Martino;45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks,2015
2. Understanding GPU errors on large-scale HPC systems and the implications for system design and operation;D Tiwari;IEEE 21st International Symposium on High Performance Computer Architecture (HPCA),2015
3. Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation;L Bautista-Gomez;IEEE International Conference on Cluster Computing,2015
4. Fault Tolerance in Distributed Neural Computing;A Kulakov;ArXiv,2015
5. Investigating the Fault Tolerance of Neural Networks;E B Tchernev;Neural Computation,2005