A step towards the final frontier: Lessons learned from acceptance testing of the first HPE/Cray EX 3000 system at ORNL

Author:

Melesse Vergara Verónica G.1ORCID,Budiardja Reuben1ORCID,Peltz Paul1,Niles Jeffery1,Zimmer Christopher1,Dietz Daniel1,Fuson Christopher1,Liu Hong1,Newman Paul1,Simmons James1,Muzyn Christopher1

Affiliation:

1. National Center for Computational Sciences Oak Ridge National Laboratory Oak Ridge Tennessee USA

Abstract

SummaryIn this article, we summarize the deployment of the Air Force Weather (AFW) HPC11 system at Oak Ridge National Laboratory (ORNL) including the process followed to successfully complete acceptance testing of the system. HPC11 is the first HPE/Cray EX 3000 system that has been successfully released to its user community in a federal facility. HPC11 consists of two identical 800‐node supercomputers, Fawbush and Miller, with access to two independent and identical lustre parallel file systems. HPC11 is equipped with Slingshot 10 interconnect technology and relies on the HPE Performance Cluster Manager software for system configuration. ORNL has a clearly defined acceptance testing process used to ensure that every new system deployed can provide the necessary capabilities to support user workloads. We worked closely with HPE and AFW to develop a set of tests that used the United Kingdom's Meteorological Office's Unified Model and 4‐dimensional variational data assimilation. We also included benchmarks and applications from the Oak Ridge Leadership Computing Facility portfolio to fully exercise the HPE/Cray programming environment and evaluate the functionality and performance of the system. Acceptance testing of HPC11 required parallel execution of each element on Fawbush and Miller. In addition, careful coordination was needed to ensure successful acceptance of the newly deployed lustre file systems alongside the compute resources. In this work, we present test results from specific system components and provide an overview of the issues identified, challenges encountered, and the lessons learned along the way.

Funder

Oak Ridge National Laboratory

Publisher

Wiley

Subject

Computational Theory and Mathematics,Computer Networks and Communications,Computer Science Applications,Theoretical Computer Science,Software

Reference12 articles.

1. Scaling the Summit: Deploying the World’s Fastest Supercomputer

2. An OpenMP 3.1 Validation Testsuite

3. SPEC.OMP2012 benchmark suite.https://www.spec.org/omp2012/

4. OSU.Microbenchmarks.http://mvapich.cse.ohio‐state.edu/benchmarks

5. EisenbachM ZhouCG NicholsonD BrownG LarkinJM SchulthessTC.Thermodynamics of magnetic systems from first principles: WL‐LSMS;2010.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3