1. Gropp, W., Lusk, E., Skjellum, A.: Using MPI, 2nd edn. MIT Press, Cambridge (1999)
2. Gropp, W., Lusk, E.: Fault tolerance in message passing interface programs. International Journal of High Performance Computing Applications 18(3), 363–372 (2004)
3. Huang, C.: System support for checkpoint and restart of Charm++ and AMPI applications. Master’s thesis, Dep. of Computer Science, University of Illinois, Urbana, IL (2004), Available at: http://charm.cs.uiuc.edu/papers/CheckpointThesis.html
4. Zheng, G., Shi, L., Kalé, L.V.: FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI. In: 2004 IEEE International Conference on Cluster Computing, San Diego, CA (2004)
5. Chakravorty, S., Kalé, L.V.: A fault tolerant protocol for massively parallel machines. In: FTPDS Workshop at IPDPS 2004, Santa Fe, NM. IEEE Press, Los Alamitos (2004)