Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter-Reference-Cited by-同舟云学术

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC’s Perlmutter

Published:2023 Issue: Volume: Page:297-316
ISSN:0302-9743
Container-title:Lecture Notes in Computer Science
language:
Short-container-title:

Author:

Li Jie^ORCID,Michelogiannakis George^ORCID,Cook Brandon^ORCID,Cooray Dulanya^ORCID,Chen Yong^ORCID

Abstract

AbstractResource demands of HPC applications vary significantly. However, it is common for HPC systems to primarily assign resources on a per-node basis to prevent interference from co-located workloads. This gap between the coarse-grained resource allocation and the varying resource demands can lead to HPC resources being not fully utilized. In this study, we analyze the resource usage and application behavior of NERSC’s Perlmutter, a state-of-the-art open-science HPC system with both CPU-only and GPU-accelerated nodes. Our one-month usage analysis reveals that CPUs are commonly not fully utilized, especially for GPU-enabled jobs. Also, around 64% of both CPU and GPU-enabled jobs used 50% or less of the available host memory capacity. Additionally, about 50% of GPU-enabled jobs used up to 25% of the GPU memory, and the memory capacity was not fully utilized in some ways for all jobs. While our study comes early in Perlmutter’s lifetime thus policies and application workload may change, it provides valuable insights on performance characterization, application behavior, and motivates systems with more fine-grain resource allocation.

Publisher

Springer Nature Switzerland

Link

https://link.springer.com/content/pdf/10.1007/978-3-031-32041-5_16

Reference26 articles.

1. Agelastos, A., et al.: The lightweight distributed metric service: a scalable infrastructure for continuous monitoring of large scale computing systems and applications. In: SC 2014: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 154–165. IEEE (2014)

2. Das, A., Mueller, F., Siegel, C., Vishnu, A.: Desh: deep learning for system health prediction of lead times to failure in HPC. In: Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, pp. 40–51 (2018)

3. Di, S., Gupta, R., Snir, M., Pershey, E., Cappello, F.: LogAider: a tool for mining potential correlations of HPC log events. In: 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), pp. 442–451. IEEE (2017)

4. Gil, Y., Greaves, M., Hendler, J., Hirsh, H.: Amplify scientific discovery with artificial intelligence. Science 346(6206), 171–172 (2014)

5. Gupta, S., Patel, T., Engelmann, C., Tiwari, D.: Failures in large scale systems: long-term measurement, analysis, and implications. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–12 (2017)

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Tandem Predictions for HPC jobs;Practice and Experience in Advanced Research Computing 2024: Human Powered Computing;2024-07-17

2. Scheduling and Allocation of Disaggregated Memory Resources in HPC Systems;2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2024-05-27

3. DUST: Resource-Aware Telemetry Offloading with A Distributed Hardware-Agnostic Approach;2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW);2024-05-27

4. Agile-DRAM: Agile Trade-Offs in Memory Capacity, Latency, and Energy for Data Centers;2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA);2024-03-02

5. A Data-driven Analysis of a Cloud Data Center: Statistical Characterization of Workload, Energy and Temperature;Proceedings of the IEEE/ACM 16th International Conference on Utility and Cloud Computing;2023-12-04