Epidemic failure detection and consensus for extreme parallelism-Reference-Cited by-同舟云学术

Epidemic failure detection and consensus for extreme parallelism

Published:2017-02-01 Issue:5 Volume:32 Page:729-743
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Katti Amogh¹,Di Fatta Giuseppe¹,Naughton Thomas²,Engelmann Christian²

Affiliation:

1. Department of Computer Science, University of Reading, UK

2. Computer Science and Mathematics Division, Oak Ridge National Laboratory, USA

Abstract

Future extreme-scale high-performance computing systems will be required to work under frequent component failures. The MPI Forum’s User Level Failure Mitigation proposal has introduced an operation, MPI_Comm_shrink, to synchronize the alive processes on the list of failed processes, so that applications can continue to execute even in the presence of failures by adopting algorithm-based fault tolerance techniques. This MPI_Comm_shrink operation requires a failure detection and consensus algorithm. This paper presents three novel failure detection and consensus algorithms using Gossiping. Stochastic pinging is used to quickly detect failures during the execution of the algorithm, failures are then disseminated to all the fault-free processes in the system and consensus on the failures is detected using the three consensus techniques. The proposed algorithms were implemented and tested using the Extreme-scale Simulator. The results show that the stochastic pinging detects all the failures in the system. In all the algorithms, the number of Gossip cycles to achieve global consensus scales logarithmically with system size. The second algorithm also shows better scalability in terms of memory and network bandwidth usage and a perfect synchronization in achieving global consensus. The third approach is a three-phase distributed failure detection and consensus algorithm and provides consistency guarantees even in very large and extreme-scale systems while at the same time being memory and bandwidth efficient.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342017690910

Reference29 articles.

1. Resilient gossip algorithms for collecting online management information in exascale clusters

2. Radiation-induced soft errors in advanced semiconductor technologies

3. xSim: The extreme-scale simulator

4. Failure Detection and Propagation in HPC systems

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. To improve scalability with Boolean matrix using efficient gossip failure detection and consensus algorithm for PeerSim simulator in IoT environment;International Journal of Information Technology;2022-05-24

2. MATCH: An MPI Fault Tolerance Benchmark Suite;2020 IEEE International Symposium on Workload Characterization (IISWC);2020-10

3. MARBLE: A Multi-GPU Aware Job Scheduler for Deep Learning on HPC Systems;2020 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID);2020-05

4. Reinit$$^{++}$$: Evaluating the Performance of Global-Restart Recovery Methods for MPI Fault Tolerance;Lecture Notes in Computer Science;2020

5. Robust Epidemic Aggregation Under Churn;Internet and Distributed Computing Systems;2018