Analysis and prediction of performance variability in large-scale computing systems-Reference-Cited by-同舟云学术

Analysis and prediction of performance variability in large-scale computing systems

Published:2024-03-28 Issue:10 Volume:80 Page:14978-15005
ISSN:0920-8542
Container-title:The Journal of Supercomputing
language:en
Short-container-title:J Supercomput

Author:

Salimi Beni Majid,Hunold Sascha,Cosenza Biagio

Abstract

AbstractThe development of new exascale supercomputers has dramatically increased the need for fast, high-performance networking technology. Efficient network topologies, such as Dragonfly+, have been introduced to meet the demands of data-intensive applications and to match the massive computing power of GPUs and accelerators. However, these supercomputers still face performance variability mainly caused by the network that affects system and application performance. This study comprehensively analyzes performance variability on a large-scale HPC system with Dragonfly+ network topology, focusing on factors such as communication patterns, message size, job placement locality, MPI collective algorithms, and overall system workload. The study also proposes an easy-to-measure metric for estimating network background traffic generated by other users, which can be used to estimate the performance of our job accurately. The insights gained from this study contribute to improving performance predictability, enhancing job placement policies and MPI algorithm selection, and optimizing resource management strategies in supercomputers.

Funder

Università degli Studi di Salerno

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s11227-024-06040-w.pdf

Reference88 articles.

1. Thoman P, Salzmann P, Cosenza B, Fahringer T (2019) Celerity: high-level C++ for accelerator clusters. In: Euro-Par 2019: Parallel Processing: 25th International Conference on Parallel and Distributed Computing, Göttingen, Germany, August 26–30, 2019, Proceedings 25. Springer, pp 291–303

2. Sojoodi AH, Salimi Beni M, Khunjush F (2021) Ignite-gpu: a gpu-enabled in-memory computing architecture on clusters. J Supercomput 77:3165–3192

3. Bhattacharjee A, Wells J (2021) Preface to special topic: bilding the bridge to the exascale-applications and opportunities for plasma physics. Phys Plasmas 28(9):090401

4. Träff JL, Lübbe FD, Rougier A, Hunold S (2015) Isomorphic, sparse MPI-like collective communication operations for parallel stencil computations. In: Proceedings of the 22nd European MPI Users’ Group Meeting, pp 1–10

5. Salzmann P, Knorr F, Thoman P, Cosenza B (2022) Celerity: how (well) does the sycl api translate to distributed clusters? In: International workshop on OpenCL, pp 1–2