Complex scientific applications made fault-tolerant with the sparse grid combination technique-Reference-Cited by-同舟云学术

Complex scientific applications made fault-tolerant with the sparse grid combination technique

Published:2016-07-27 Issue:3 Volume:30 Page:335-359
ISSN:1094-3420
Container-title:The International Journal of High Performance Computing Applications
language:en
Short-container-title:The International Journal of High Performance Computing Applications

Author:

Ali Md Mohsin¹,Strazdins Peter E¹,Harding Brendan²,Hegland Markus²

Affiliation:

1. Research School of Computer Science, The Australian National University, Canberra, Australia

2. Mathematical Sciences Institute, The Australian National University, Canberra, Australia

Abstract

Ultra-large–scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving higher dimensional PDEs can be easily modified to provide algorithm-based fault tolerance. In this article, we describe how the SGCT can produce fault-tolerant versions of the Gyrokinetic Electromagnetic Numerical Experiment plasma application, Taxila Lattice Boltzmann Method application, and Solid Fuel Ignition application. We use an alternate component grid combination formula by adding some redundancy on the SGCT to recover data from lost processes. User-level failure mitigation (ULFM) message passing interface (MPI) is used to recover the processes, and our implementation is robust over multiple failures and recovery (processes and nodes). An acceptable degree of modification of the applications is required. Results using the 2-D SGCT show competitive execution times with acceptable error (within 0.1% to 1.0%), compared to the same simulation with a single full resolution grid. The benefits improve when the 3-D SGCT is used. Experiments show the applications ability to successfully recover from multiple failures, and applying multiple SGCT reduces the computed solution error. Process recovery via ULFM MPI increases from approximately 1.5 sec at 64 cores to approximately 5 sec at 2048 cores for a one-off failure. This compares applications’ built-in checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the recomputation overhead. An analysis for a long-running application considering recomputation times indicates a reduction in overhead of over an order of magnitude.

Publisher

SAGE Publications

Subject

Hardware and Architecture,Theoretical Computer Science,Software

Link

http://journals.sagepub.com/doi/pdf/10.1177/1094342015628056

Reference43 articles.

1. Tofu: A 6D Mesh/Torus Interconnect for Exascale Computers

2. A fault-tolerant gyrokinetic plasma application using the sparse grid combination technique

3. Application Level Fault Recovery: Using Fault-Tolerant Open MPI in a PDE Solver

4. Mathematical Problems from Combustion Theory

Cited by 15 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Taking the MPI standard and the open MPI library to exascale;The International Journal of High Performance Computing Applications;2024-07-23

2. A Dimension-Oblivious Domain Decomposition Method Based on Space-Filling Curves;SIAM Journal on Scientific Computing;2023-04-07

3. Response of HPC hardware to neutron radiation at the dawn of exascale;The Journal of Supercomputing;2023-03-30

4. ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms;2022 IEEE/ACM 12th Workshop on Fault Tolerance for HPC at eXtreme Scale (FTXS);2022-11

5. Resiliency in numerical algorithm design for extreme scale simulations;The International Journal of High Performance Computing Applications;2021-12-10