A step towards the final frontier: Lessons learned from acceptance testing of the first HPE/Cray EX 3000 system at ORNL
-
Published:2023-10-21
Issue:3
Volume:36
Page:
-
ISSN:1532-0626
-
Container-title:Concurrency and Computation: Practice and Experience
-
language:en
-
Short-container-title:Concurrency and Computation
Author:
Melesse Vergara Verónica G.1ORCID,
Budiardja Reuben1ORCID,
Peltz Paul1,
Niles Jeffery1,
Zimmer Christopher1,
Dietz Daniel1,
Fuson Christopher1,
Liu Hong1,
Newman Paul1,
Simmons James1,
Muzyn Christopher1
Affiliation:
1. National Center for Computational Sciences Oak Ridge National Laboratory Oak Ridge Tennessee USA
Abstract
SummaryIn this article, we summarize the deployment of the Air Force Weather (AFW) HPC11 system at Oak Ridge National Laboratory (ORNL) including the process followed to successfully complete acceptance testing of the system. HPC11 is the first HPE/Cray EX 3000 system that has been successfully released to its user community in a federal facility. HPC11 consists of two identical 800‐node supercomputers, Fawbush and Miller, with access to two independent and identical lustre parallel file systems. HPC11 is equipped with Slingshot 10 interconnect technology and relies on the HPE Performance Cluster Manager software for system configuration. ORNL has a clearly defined acceptance testing process used to ensure that every new system deployed can provide the necessary capabilities to support user workloads. We worked closely with HPE and AFW to develop a set of tests that used the United Kingdom's Meteorological Office's Unified Model and 4‐dimensional variational data assimilation. We also included benchmarks and applications from the Oak Ridge Leadership Computing Facility portfolio to fully exercise the HPE/Cray programming environment and evaluate the functionality and performance of the system. Acceptance testing of HPC11 required parallel execution of each element on Fawbush and Miller. In addition, careful coordination was needed to ensure successful acceptance of the newly deployed lustre file systems alongside the compute resources. In this work, we present test results from specific system components and provide an overview of the issues identified, challenges encountered, and the lessons learned along the way.
Funder
Oak Ridge National Laboratory
Subject
Computational Theory and Mathematics,Computer Networks and Communications,Computer Science Applications,Theoretical Computer Science,Software
Reference12 articles.
1. Scaling the Summit: Deploying the World’s Fastest Supercomputer
2. An OpenMP 3.1 Validation Testsuite
3. SPEC.OMP2012 benchmark suite.https://www.spec.org/omp2012/
4. OSU.Microbenchmarks.http://mvapich.cse.ohio‐state.edu/benchmarks
5. EisenbachM ZhouCG NicholsonD BrownG LarkinJM SchulthessTC.Thermodynamics of magnetic systems from first principles: WL‐LSMS;2010.