Physics Data Production on HPC: Experience to be efficiently running at scale-Reference-Cited by-同舟云学术

Physics Data Production on HPC: Experience to be efficiently running at scale

Published:2020 Issue: Volume:245 Page:09003
ISSN:2100-014X
Container-title:EPJ Web of Conferences
language:
Short-container-title:EPJ Web Conf.

Author:

Poat M D,Lauret J,Porter J,Balewski J

Abstract

The Solenoidal Tracker at RHIC (STAR) is a multi-national supported experiment located at the Brookhaven National Lab and is currently the only remaining running experiment at RHIC. The raw physics data captured from the detector is on the order of tens of PBytes per data acquisition campaign, making STAR fit well within the definition of a big data science experiment. The production of the data has typically run using a High Throughput Computing (HTC) approach either done on a local farm or via Grid computing resources. Especially, all embedding simulations (complex workflow mixing real and simulated events) have been run on standard Linux resources at NERSC’s Parallel Distributed Systems Facility (PDSF). However, as per April 2019 PDSF has been retired and High Performance Computing (HPC) resources such as the Cray XC-40 Supercomputer known as “Cori” have become available for STAR’s data production as well as embedding. STAR has been the very first experiment to show feasibility of running a sustainable data production campaign on this computing resource. In this contribution, we hope to share with the community the best practices for using such resource efficiently. The use of Docker containers with Shifter is the standard approach to run on HPC at NERSC – this approach encapsulates the environment in which a standard STAR workflow runs. From the deployment of a tailored Scientific Linux environment (with the set of libraries and special configurations required for STAR to run) to the deployment of third-party software and the STAR specific software stack, we’ve learned it has become impractical to rely on a set of containers comprising each specific software release. To this extent, a solution based on the CernVM File System (CVMFS) for the deployment of software and services has been deployed but it doesn’t stop there. One needs to make careful scalability considerations when using a resource like Cori, such as avoiding metadata lookups, scalability of distributed filesystems, and real limitations of containerized environments on HPC. Additionally, CVMFS clients are not compatible on Cori nodes and one needs to rely on an indirect NFS mount scheme using custom services known as DVS servers designed to forward data to worker nodes. In our contribution, we will discuss our strategies from the past and our current solution based on CVMFS. The second focus of our presentation will be to discuss strategies to find the most efficient use of database Shifter containers serving our data production (a near “database as a service” approach) and the best methods to test and scale your workflow efficiently.

Publisher

EDP Sciences

Link

https://www.epj-conferences.org/10.1051/epjconf/202024509003/pdf

Reference8 articles.

1. Poat M. D., Lauret J., Porter J., Balewski J. (ACAT2019 Proceedings, to be published) https://indico.cern.ch/event/708041/contributions/3276344/

2. Top 500 – November 2019 https://www.top500.org/lists/2019/11/

3. CernVM File System (CernVM-FS) https://cernvm.cern.ch/portal/filesystem

4. The FUSE Module https://www.kernel.org/doc/Documentation/filesystems/fuse.txt

5. Cray Data Virtualization Service (DVS) https://pubs.cray.com/content/S-0005/CLE%206.0.UP05/xctm-series-dvs-administration-guide/introduction-to-dvs

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data transfer for STAR grid jobs;Journal of Physics: Conference Series;2023-02-01

2. Visual Analysis Application for the Error Messages Clustering Framework;Procedia Computer Science;2021