Assessing General-Purpose Algorithms to Cope with Fail-Stop and Silent Errors

Author:

Benoit Anne1,Cavelan Aurélien1,Robert Yves2,Sun Hongyang1

Affiliation:

1. École Normale Supérieure de Lyon, CNRS & INRIA, France

2. École Normale Supérieure de Lyon, CNRS & INRIA, France, and University of Tennessee Knoxville

Abstract

In this article, we combine the traditional checkpointing and rollback recovery strategies with verification mechanisms to cope with both fail-stop and silent errors. The objective is to minimize makespan and/or energy consumption. For divisible load applications, we use first-order approximations to find the optimal checkpointing period to minimize execution time, with an additional verification mechanism to detect silent errors before each checkpoint, hence extending the classical formula by Young and Daly for fail-stop errors only. We further extend the approach to include intermediate verifications, and to consider a bicriteria problem involving both time and energy (linear combination of execution time and energy consumption). Then, we focus on application workflows whose dependence graph is a linear chain of tasks. Here, we determine the optimal checkpointing and verification locations, with or without intermediate verifications, for the bicriteria problem. Rather than using a single speed during the whole execution, we further introduce a new execution scenario, which allows for changing the execution speed via Dynamic Voltage and Frequency Scaling (DVFS). In this latter scenario, we determine the optimal checkpointing and verification locations, as well as the optimal speed pairs for each task segment between any two consecutive checkpoints. Finally, we conduct an extensive set of simulations to support the theoretical study, and to assess the performance of each algorithm, showing that the best overall performance is achieved under the most flexible scenario using intermediate verifications and different speeds.

Funder

Agence Nationale de la Recherche

Publisher

Association for Computing Machinery (ACM)

Subject

Computational Theory and Mathematics,Computer Science Applications,Hardware and Architecture,Modeling and Simulation,Software

Cited by 6 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Checkpointing strategies to tolerate non-memoryless failures on HPC platforms;ACM Transactions on Parallel Computing;2023-09-22

2. Checkpointing Workflows à la Young/Daly Is Not Good Enough;ACM Transactions on Parallel Computing;2022-12-16

3. Checkpointing à la Young/Daly: An Overview;Proceedings of the 2022 Fourteenth International Conference on Contemporary Computing;2022-08-04

4. A generic approach to scheduling and checkpointing workflows;The International Journal of High Performance Computing Applications;2019-08-12

5. Multi-level checkpointing and silent error detection for linear workflows;Journal of Computational Science;2018-09

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3