Building and utilizing fault tolerance support tools for the GASPI applications-Reference-Cited by-同舟云学术

Building and utilizing fault tolerance support tools for the GASPI applications

Published:2016-11-28 Issue:5 Volume:32 Page:613-626
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Shahzad Faisal¹,Kreutzer Moritz¹,Zeiser Thomas¹,Machado Rui²,Pieper Andreas¹³,Hager Georg¹,Wellein Gerhard¹

Affiliation:

1. Erlangen Regional Computing Center, University of Erlangen–Nuremberg Erlangen, Germany

2. Fraunhofer Institute for Industrial Mathematics (ITWM), Fraunhofer Platz 1, Kaiserslautern, Germany

3. Institute of Physics, University of Greifswald, Greifswald, Germany

Abstract

Today’s high performance computing systems are made possible by multiple increases in hardware parallelity. This results in the decrease of mean time to failures of the systems with each newer generation, which is an alarming trend. Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. We have used Global Address Space Programming Interface (GASPI), which is a relatively new communication library based on the PGAS model. It fulfills the basic requirement of a fault tolerant communication library, i.e. the failure of a process does not cause the remaining processes to fail. This work is focused on extending the fault tolerance features of GASPI in the form of a supporting health-check library that applications can benefit from. These features include failure detection, its information propagation, recovery management, communication recovery, etc. To reinforce its utility, we have also developed a fault tolerant neighbor node-level checkpoint/restart library. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how (using these supplementary fault tolerance functions) one can build applications to allow integrate a low cost fault detection/recovery mechanism and, if necessary, recover the application on the fly. We showcase the usage of these tools by implementing them in three different applications. Two of the applications fall in the category of linear sparse solvers, whereas the third application is based on a fluid flow solver. We also analyze the overheads involved in failure-free cases as well as various failure cases. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342016677085

Reference26 articles.

1. Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver

2. Application-Specific Fault Tolerance via Data Access Characterization

3. FTI

4. Fault tolerance for remote memory access programming models

5. A Model for Collision Processes in Gases. I. Small Amplitude Processes in Charged and Neutral One-Component Systems

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. DiPOSH: A portable OpenSHMEM implementation for short API‐to‐network path;Concurrency and Computation: Practice and Experience;2021-02-04

2. Checkpointing OpenSHMEM Programs Using Compiler Analysis;2020 IEEE/ACM 10th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS);2020-11