Author:
Zhang Guozhen,Liu Yi,Yang Hailong,Xu Jun,Qian Depei
Publisher
Springer Science and Business Media LLC
Subject
General Computer Science,Theoretical Computer Science
Reference40 articles.
1. Egwutuoha I P, Levy D, Selic B, Chen S. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. The Journal of Supercomputing, 2013, 65: 1302–1326
2. Lu C D. Failure data analysis of hpc systems. 2013, arXiv preprint arXiv:1302.4779
3. Cappello F, Geist A, Gropp W D, Kale L V, Kramer W T, Snir M. Toward exascale resilience. International Journal of High Performance Computing Applications, 2009, 23: 374–385
4. Bertier M, Marin O, Sens P. Performance analysis of a hierarchical failure detector. In: Proceedings of the 2003 International Conference on Dependable Systems and Networks. 2003, 635–644
5. Luecke G R, Zou Y, Coyle J, Hoekstra J, Kraeva M. Deadlock detection in MPI programs. Concurrency and Computation: Practice and Experience, 2002, 14: 911–932