Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters-Reference-Cited by-同舟云学术

Task-Level Checkpointing and Localized Recovery to Tolerate Permanent Node Failures for Nested Fork–Join Programs in Clusters

Published:2024-03-13 Issue:3 Volume:5 Page:
ISSN:2661-8907
Container-title:SN Computer Science
language:en
Short-container-title:SN COMPUT. SCI.

Author:

Reitz Lukas^ORCID,Fohry Claudia

Abstract

AbstractExascale supercomputers consist of millions of processing units, and this number is still growing. Therefore, hardware failures, such as permanent node failures, become increasingly frequent. They can be tolerated with system-level Checkpoint/Restart, which saves the whole application state transparently and, if needed, restarts the application from the saved state; or with application-level checkpointing, which saves only relevant data via explicit calls in the program. The former approach requires no additional programming expense, whereas the latter is more efficient and allows to continue program execution after failures on the intact resources (localized shrinking recovery). An increasingly popular programming paradigm is asynchronous many-task (AMT) programming. Here, programmers identify parallel tasks, and a runtime system assigns the tasks to worker threads. Since tasks have clearly defined interfaces, the runtime system can automatically extract and save their interface data. This approach, called task-level checkpointing (TC), combines the respective strengths of system-level and application-level checkpointing. AMTs come in many variants, and so far, TC has only been applied to a few, rather simple variants. This paper considers TC for a different AMT variant: nested fork–join (NFJ) programs that run on clusters of multicore nodes under work stealing. We present the first TC scheme for this setting. It performs a localized shrinking recovery and can handle multiple node failures. In experiments with four benchmarks, we observed execution time overheads of around 44 % at 1536 workers, and negligible recovery costs. Additionally, we developed and experimentally validated a prediction model for the running times of the scheme.

Funder

Deutsche Forschungsgemeinschaft

Universität Kassel

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s42979-024-02624-8.pdf

Reference53 articles.

1. Ansel J, Arya K, Cooperman G. DMTCP: transparent checkpointing for cluster computations and the desktop. In: Proceedings international parallel and distributed processing symposium (IPDPS). IEEE. 2009. pp. 1–12. https://doi.org/10.1109/ipdps.2009.5161063.

2. Augonnet C, Thibault S, Namyst R, et al. StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp. 2011;23:187–98. https://doi.org/10.1002/cpe.1631.

3. Bautista-Gomez L, Tsuboi S, Komatitsch D, et al. FTI: High performance fault tolerance interface for hybrid systems. In: Proceedings international conference for high performance computing, networking, storage and analysis (SC). ACM. 2011. pp. 1–32. https://doi.org/10.1145/2063384.2063427.

4. Benoit A, Herault T, Fèvre VL, et al. Replication is more efficient than you think. In: Proceedings international conference for high performance computing, networking, storage and analysis (SC). ACM. 2019. pp. 1–14. https://doi.org/10.1145/3295500.3356171.

5. Blumofe RD, Leiserson CE. Scheduling multithreaded computations by work stealing. J ACM. 1999;46(5):720–48. https://doi.org/10.1145/324133.324234.