Queue congestion prediction for large-scale high performance computing systems using a hidden Markov model

Author:

Park Ju-WonORCID,Kwon Min-Woo,Hong Taeyoung

Abstract

AbstractTo share limited, large-capacity resources, the high-performance computing field provides services by allocating available resources to jobs through batch job schedulers. Therefore, it is natural that a queue waiting time occurs until the resources are available if resources are not sufficient. The prediction of queue waiting time is very useful to improve overall resource utilization. However, the queue waiting time is very difficult to predict because it is significantly affected by the many factors such as applied scheduling algorithm and characteristics of the executed job. In this study, a method of predicting queue waiting time using only the historical log data created by the batch job scheduler is examined. Specifically, a method of predicting queue waiting time based on a hidden Markov model is proposed. It has the following three stages. First, outliers are removed by applying the outlier detection algorithm using a statistics-based parametric method. Second, the parameters of the hidden state are estimated using the observed queue waiting time sequence based on the historical job log. Third, the queue waiting interval at time $$t+1$$ t + 1 is provided using the estimated parameters at time t. Comparing the prediction accuracy with those of the other prediction methods, experimental results show that the proposed algorithm improves the prediction accuracy by up to 60%.

Publisher

Springer Science and Business Media LLC

Subject

Hardware and Architecture,Information Systems,Theoretical Computer Science,Software

Cited by 6 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Quantifying Uncertainty in HPC Job Queue Time Predictions;Practice and Experience in Advanced Research Computing 2024: Human Powered Computing;2024-07-17

2. Tandem Predictions for HPC jobs;Practice and Experience in Advanced Research Computing 2024: Human Powered Computing;2024-07-17

3. Predicting accurate batch queue wait times on production supercomputers by combining machine learning techniques;Concurrency and Computation: Practice and Experience;2024-04-11

4. An Empirical Design and Implementation of Job Scheduling Enhancement for Kubernetes Clusters;2024 International Conference on Information Networking (ICOIN);2024-01-17

5. Approbation of Methods for Supercomputer Job Queue Wait Time Estimation;Lobachevskii Journal of Mathematics;2023-08

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3