Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing-Reference-Cited by-同舟云学术

Process Fault Tolerance: Semantics, Design and Applications for High Performance Computing

Published:2005-11 Issue:4 Volume:19 Page:465-477
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Fagg Graham E.¹,Gabriel Edgar²,Chen Zizhong,Angskun Thara,Bosilca George,Pjesivac-Grbovic Jelena,Dongarra Jack J.¹

Affiliation:

1. INNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA, ()

2. HIGH PERFORMANCE COMPUTING CENTER STUTTGART, UNIVERSITY OF STUTTGART, D-70550 STUTTGART, GERMANY, AND INNOVATIVE COMPUTING LABORATORY, COMPUTER SCIENCE DEPARTMENT UNIVERSITY OF TENNESSEE, KNOXVILLE, TN 37996-3450, USA

Abstract

With increasing numbers of processors on current machines, the probability for node or link failures is also increasing. Therefore, application-level fault tolerance is becoming more of an important issue for both end-users and the institutions running the machines. In this paper we present the semantics of a fault-tolerant version of the message passing interface (MPI), the de-facto standard for communication in scientific applications, which gives applications the possibility to recover from a node or link error and continue execution in a well-defined way. We present the architecture of fault-tolerant MPI, an implementation of MPI using the semantics presented above as well as benchmark results with various applications. An example of a fault-tolerant parallel equation solver, performance results as well as the time for recovering from a process failure are furthermore detailed.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342005056137

Reference20 articles.

1. HARNESS: a next generation distributed virtual machine

Cited by 21 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint;IEEE Transactions on Parallel and Distributed Systems;2024-07

2. TeaMPI—Replication-Based Resilience Without the (Performance) Pain;Lecture Notes in Computer Science;2020

3. Fault Tolerance Techniques for Distributed, Parallel Applications;Innovative Research and Applications in Next-Generation High Performance Computing;2016

4. Fail-Stop Failure Algorithm-Based Fault Tolerance for Cholesky Decomposition;IEEE Transactions on Parallel and Distributed Systems;2015-05-01

5. Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches;Computers in Biology and Medicine;2014-05