Affiliation:
1. Laboratoire LIP ENS Lyon Lyon France
2. Department of Mathematics Tongji University Shanghai China
3. Project TADAAM Inria Bordeaux Bordeaux France
4. Innovative Computing Laboratory University of Tennessee Knoxville Tennessee USA
Abstract
SummaryAfter a machine failure, batch schedulers typically re‐schedule the job that failed with a high priority. This is fair for the failed job but still requires that job to re‐enter the submission queue and to wait for enough resources to become available. The waiting time can be very long when the job is large and the platform highly loaded, as is the case with typical HPC platforms. We propose another strategy: when a job fails, if no platform node is available, we steal one node from another job , and use it to continue the execution of despite the failure. In this work, we give a detailed assessment of this node stealing strategy using traces from the Mira supercomputer at Argonne National Laboratory. The main conclusion is that node stealing improves the utilization of the platform and dramatically reduces the flow of large jobs, at the price of slightly increasing the flow of small jobs.
Reference32 articles.
1. Top500.Top 500 Supercomputer Sites.2022.https://www.top500.org/lists/top500/2022/06/
2. Fault-Tolerance Techniques for High-Performance Computing
3. IBM Spectrum LSF Job Scheduler.Fault tolerance and automatic management host failover.2021.https://www.ibm.com/docs/en/spectrum‐lsf/10.1.0?topic=cluster‐fault‐tolerance
4. A Survey on Spot Pricing in Cloud Computing
5. Backup or Not: An Online Cost Optimal Algorithm for Data Analysis Jobs Using Spot Instances