Combining Checkpointing and Replication for Reliable Execution of Linear Workflows with Fail-Stop and Silent Errors

Author:

Benoit Anne1,Cavelan Aurélien2,Ciorba Florina M.2,Fèvre Valentin Le1,Robert Yves13

Affiliation:

1. ENS Lyon

2. University of Basel

3. University of Tennessee

Publisher

IJNC Editorial Committee

Reference16 articles.

1. [9] E. S. Buneci. Qualitative Performance Analysis for Large-Scale Scientific Workflows. PhD thesis, Duke University, 2008.

2. [11] Franck Cappello, Al Geist, William Gropp, Sanjay Kale, Bill Kramer, and Marc Snir. Toward Exascale Resilience: 2014 update. Supercomputing frontiers and innovations, 1(1), 2014.

3. [17] Sheng Di, Yves Robert, Frederic Vivien, and Franck Cappello. Toward an optimal online checkpoint solution under a two-level HPC checkpoint model. IEEE Trans. Parallel & Distributed Systems, 2016.

4. [18] James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann. Combining partial redundancy and checkpointing for HPC. In ICDCS. IEEE, 2012.

5. [21] C. Engelmann, H. H. Ong, and S. L. Scorr. The case for modular redundancy in large-scale high performance computing systems. In PDCN. IASTED, 2009.

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A survey on checkpointing strategies: Should we always checkpoint à la Young/Daly?;Future Generation Computer Systems;2024-12

2. HGR: A Hybrid Global Graph-Based Recovery Approach for Cloud Storage Systems with Failure and Straggler Nodes;2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS);2024-07-23

3. Allocation and Scheduling of Linear Workflows Incorporating Security Constraints Across Fog and Cloud Infrastructures;2024 International Conference on Computer, Information and Telecommunication Systems (CITS);2024-07-17

4. Scheduling Different Types of Linear Workflows with Partial Computations in a Distributed System;2023 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI);2023-10-18

5. Security-Aware Orchestration of Linear Workflows on Distributed Resources;2022 International Conference on Computer, Information and Telecommunication Systems (CITS);2022-07-13

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3