Affiliation:
1. ETH Zürich, Zurich, Switzerland
Abstract
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network transport and routing protocols, which may introduce network noise that can eventually limit the application scaling. This work analyzes network performance, scalability, and cost of running HPC workloads on cloud systems. First, we consider latency, bandwidth, and collective communication patterns in detailed small-scale measurements, and then we simulate network performance at a larger scale. We validate our approach on four popular cloud providers and three on-premise HPC systems, showing that network (and also OS) noise can significantly impact performance and cost both at small and large scale.
Funder
European Research Council
HORIZON EUROPE Framework Programme
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications,Hardware and Architecture,Safety, Risk, Reliability and Quality,Computer Science (miscellaneous)
Reference77 articles.
1. Top 500. 2022. Top 500 List. https://www.top500.org/. Accessed: 31-Mar-2022. Top 500. 2022. Top 500 List. https://www.top500.org/. Accessed: 31-Mar-2022.
2. LogGP: Incorporating Long Messages into the LogP Model for Parallel Computation
3. CONGA
4. Benchmarking Microsoft Azure Virtual Machines for the use of HPC applications
5. Bob Alverson , Edwin Froese , Larry Kaplan , and Duncan Roweth . 2012. Cray XC series network . Cray Inc., White Paper WP-Aries 01--1112 ( 2012 ). Bob Alverson, Edwin Froese, Larry Kaplan, and Duncan Roweth. 2012. Cray XC series network. Cray Inc., White Paper WP-Aries01--1112 (2012).
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Software Resource Disaggregation for HPC with Serverless Computing;2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS);2024-05-27
2. ExDe: Design space exploration of scheduler architectures and mechanisms for serverless data-processing;Future Generation Computer Systems;2024-04
3. Canary: Congestion-aware in-network allreduce using dynamic trees;Future Generation Computer Systems;2024-03
4. HEAR: Homomorphically Encrypted Allreduce;Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis;2023-11-11
5. Analytical Approaches to QoS Analysis and Performance Modelling in Fog Computing;Multi-Disciplinary Applications of Fog Computing;2023-08-03