Task-Level Resilience: Checkpointing vs. Supervision
-
Published:2022
Issue:1
Volume:12
Page:47-72
-
ISSN:2185-2839
-
Container-title:International Journal of Networking and Computing
-
language:en
-
Short-container-title:IJNC
Author:
Posner Jonas1, Reitz Lukas1, Fohry Claudia1
Publisher
IJNC Editorial Committee
Reference78 articles.
1. [1] Jonas Posner, Lukas Reitz, and Claudia Fohry. Checkpointing vs. supervision resilience approaches for dynamic independent tasks. In Proc. Int. Parallel and Distributed Processing Symp. (IPDPS) Workshops (APDCM). IEEE, 2021. 2. [2] Marc Snir, Robert W Wisniewski, Jacob A Abraham, Sarita V Adve, Saurabh Bagchi, Pavan Balaji, Jim Belak, Pradip Bose, Franck Cappello, Bill Carlson, Andrew A Chien, Paul Coteus, Nathan A DeBardeleben, Pedro C Diniz, Christian Engelmann, Mattan Erez, Saverio Fazzari, Al Geist, Rinku Gupta, Fred Johnson, Sriram Krishnamoorthy, Sven Leyffer, Dean Liberty, Subhasish Mitra, Todd Munson, Rob Schreiber, Jon Stearley, and Eric Van Hensbergen. Addressing failures in exascale computing. The Int. Journal of High Performance Computing Applications (IJHPCA), 28(2):129–173, 2014. 3. [3] Thomas Herault and Yves Robert, editors. Fault-Tolerance Techniques for High-Performance Computing. Springer, 2015. 4. [4] Al Geist. How to kill a supercomputer: Dirty power, cosmic rays, and bad solder. IEEE Spectrum, 10:2–3, 2016. URL: https://spectrum.ieee.org/computing/hardware/how-to-kill-a- supercomputer-dirty-power-cosmic-rays-and-bad-solder. 5. [5] Faisal Shahzad, Markus Wittmann, Moritz Kreutzer, Thomas Zeise, Georg Hager, and Gerhard Wellein. A survey of checkpoint/restart techniques on distributed memory systems. Parallel Processing Letters (PPL), 23(4):1340011–1340030, 2013.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
|
|