Affiliation:
1. College of Computer National University of Defense Technology Changsha China
2. Computational Aerodynamics Institute China Aerodynamics Research and Development Center Mianyang China
3. State Key Laboratory of Aerodynamics China Aerodynamics Research and Development Center Mianyang China
Abstract
SummarySupercomputers are advanced computing systems interconnected through high‐speed communication networks, consisting of independent computational nodes. During the unfolding of the big data era, the potent computational capabilities of these supercomputers play a pivotal role in scientific computing. Despite executing numerous advanced computational science and engineering tasks on supercomputers, many submitted jobs fail due to various factors, resulting in user inefficiencies. These failures not only consume system resources but also reduce the overall efficiency of the system. Previous research often couples job performance features with a single machine learning method for predicting job failure. However, a primary hurdle emerges from the high cost of gathering these features, complicating their real‐world applicability. To address this challenge, our study establishes correlations among job applications through extensive job log analysis. Leveraging correlations, we propose a predictive framework based on job application sequence correlation (called FP‐JSC). This innovative framework employs multiple machine learning models to offer holistic predictions, selecting the most suitable model based on its learning effectiveness. Moreover, the framework optimizes feature collection expenses without adversely affecting job execution. We determine job applications using both job paths and job names, with the former emerging as a novel feature derived from supplementary monitoring data. Empirical results underscore FP‐JSC's effectiveness, accurately identifying over 89% of jobs with 95% specificity and 89% sensitivity—outperforming single prediction methods employed in related works.
Funder
National Natural Science Foundation of China