1. Fault-tolerance Techniques for High-performance Computing;Herault,2015
2. DMTCP: Transparent checkpointing for cluster computations and the desktop;Ansel,2009
3. Berkeley lab checkpoint/restart (BLCR) for linux clusters;Hargrove;J. Phys. Conf. Ser.,2006
4. A survey of checkpoint/restart techniques on distributed memory systems;Shahzad;Parallel Process. Lett.,2013
5. Toward exascale resilience: 2014 update;Cappell;Supercomput. Front. Innov.,2014