1. [1] S. Amarasinghe, D. Campbell, W. Carlson, A. Chien, W. Dally, E. Elnohazy, M. Hall, R. Harrison, W. Harrod, and K. Hill, “ExaScale Software Study: Software Challenges in Extreme Scale Systems,” DARPA IPTO, Air Force Research Labs, Tech. Rep, pp.1-153, 2009.
2. [2] B. Schroeder and G.A. Gibson, “Understanding Failures in Petascale Computers,” Journal of Physics: Conference Series, vol.78, no.1, pp.12-22, 2007. 10.1088/1742-6596/78/1/012022
3. [3] E.N.M. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, “A Survey of Rollbackrecovery Protocols in Message-passing Systems,” ACM Comput. Surv., vol.34, no.3, pp.375-408, 2002. 10.1145/568522.568525
4. [4] J. Hursey, “Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems,” PhD thesis, Indiana University, 2010.
5. [5] J.C. Sancho, F. Pertini, G. Johnson, J. Fernandez, and E. Frachtenberg, “On the Feasibility of Incremental Checkpointing for Scientific Computing,” Proceedings of IPDPS 2014, pp.58-67, 2004. 10.1109/ipdps.2004.1302982