Resiliency in numerical algorithm design for extreme scale simulations

Author:

Agullo Emmanuel1,Altenbernd Mirco2,Anzt Hartwig3,Bautista-Gomez Leonardo4,Benacchio Tommaso5,Bonaventura Luca5,Bungartz Hans-Joachim6,Chatterjee Sanjay7,Ciorba Florina M8,DeBardeleben Nathan9,Drzisga Daniel6,Eibl Sebastian10,Engelmann Christian11,Gansterer Wilfried N12,Giraud Luc1,Göddeke Dominik2,Heisig Marco10,Jézéquel Fabienne13,Kohl Nils10,Li Xiaoye Sherry14,Lion Romain15,Mehl Miriam2,Mycek Paul16,Obersteiner Michael6,Quintana-Ortí Enrique S17,Rizzi Francesco18,Rüde Ulrich1016,Schulz Martin6,Fung Fred19,Speck Robert20,Stals Linda19ORCID,Teranishi Keita21,Thibault Samuel15,Thönnes Dominik10,Wagner Andreas6,Wohlmuth Barbara6

Affiliation:

1. Inria, France

2. Universität Stuttgart, Germany

3. KIT – Karlsruher Institut für Technologie, Germany

4. Barcelona Supercomputing Center, Spain

5. Politecnico di Milano, Italy

6. TU München, Germany

7. NVIDIA Corporation, USA

8. Universität Basel, Switzerland

9. Los Alamos National Laboratory, USA

10. Universität Erlangen, Nürnberg, Germany

11. Oak Ridge National Laboratory, USA

12. University of Vienna, Austria

13. Université Paris 2, Paris, France

14. Lawrence Berkeley National Laboratory, USA

15. University of Bordeaux, France

16. Cerfacs, France

17. Universitat Politècnica de València, Spain

18. NexGen Analytics, USA

19. Australian National University, Australia

20. Forschungszentrum Jülich GmbH, Germany

21. Sandia National Laboratories, California, USA

Abstract

This work is based on the seminar titled ‘Resiliency in Numerical Algorithm Design for Extreme Scale Simulations’ held March 1–6, 2020, at Schloss Dagstuhl, that was attended by all the authors. Advanced supercomputing is characterized by very high computation speeds at the cost of involving an enormous amount of resources and costs. A typical large-scale computation running for 48 h on a system consuming 20 MW, as predicted for exascale systems, would consume a million kWh, corresponding to about 100k Euro in energy cost for executing 1023 floating-point operations. It is clearly unacceptable to lose the whole computation if any of the several million parallel processes fails during the execution. Moreover, if a single operation suffers from a bit-flip error, should the whole computation be declared invalid? What about the notion of reproducibility itself: should this core paradigm of science be revised and refined for results that are obtained by large-scale simulation? Naive versions of conventional resilience techniques will not scale to the exascale regime: with a main memory footprint of tens of Petabytes, synchronously writing checkpoint data all the way to background storage at frequent intervals will create intolerable overheads in runtime and energy consumption. Forecasts show that the mean time between failures could be lower than the time to recover from such a checkpoint, so that large calculations at scale might not make any progress if robust alternatives are not investigated. More advanced resilience techniques must be devised. The key may lie in exploiting both advanced system features as well as specific application knowledge. Research will face two essential questions: (1) what are the reliability requirements for a particular computation and (2) how do we best design the algorithms and software to meet these requirements? While the analysis of use cases can help understand the particular reliability requirements, the construction of remedies is currently wide open. One avenue would be to refine and improve on system- or application-level checkpointing and rollback strategies in the case an error is detected. Developers might use fault notification interfaces and flexible runtime systems to respond to node failures in an application-dependent fashion. Novel numerical algorithms or more stochastic computational approaches may be required to meet accuracy requirements in the face of undetectable soft errors. These ideas constituted an essential topic of the seminar. The goal of this Dagstuhl Seminar was to bring together a diverse group of scientists with expertise in exascale computing to discuss novel ways to make applications resilient against detected and undetected faults. In particular, participants explored the role that algorithms and applications play in the holistic approach needed to tackle this challenge. This article gathers a broad range of perspectives on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations. The ultimate goal is to spark novel ideas and encourage the development of concrete solutions for achieving such resilience holistically.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Fault-Tolerant Parallel Multigrid Method on Unstructured Adaptive Mesh;SIAM Journal on Scientific Computing;2024-06-06

2. FT-GCR: A fault-tolerant generalized conjugate residual elliptic solver;Journal of Computational Physics;2022-04

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3