Author:
Merzky Andre,Svirin Pavlo,Turilli Matteo
Abstract
PanDA executes millions of ATLAS jobs a month on Grid systems with more than 300,000 cores. Currently, PanDA is compatible only with few high-performance computing (HPC) resources due to different edge services and operational policies; does not implement the pilot paradigm on HPC; and does not dynamically optimize resource allocation among queues. We integrated the PanDA Harvester service and the RADICAL-Pilot (RP) system to overcome these limitations and enable the execution of ATLAS, Molecular Dy-namics and other workloads on HPC resources. This paper offer two main con-tributions: (1) introducing PanDA Harvester and RADICAL-Pilot, two systems independent developed to support high-throughput computing (HTC) on high-performance computing (HPC) infrastructures; (2) describing the integration between these two systems to produce a middleware component with unique functionalities, including the concurrent execution of heterogeneous workloads on the Titan OLCF machine. We integrated Harvester and RP by prototyping a Next Generation Executor (NGE) to expose RP capabilities and manage the execution of PanDA workloads. In this way, we minimized the reengineering of the two systems, allowing their integration while being in production.
Reference35 articles.
1. Maeno T.,
De K.,
Klimentov A.,
Nilsson P.,
Oleynik D.,
Panitkin S.,
Petrosyan A.,
Schovancova J.,
Vaniachine A.,
Wenaus T. et al.,
Evolution of the ATLAS PanDA workload management system for exascale computational science,
in Journal of Physics: Conference Series
(
IOP Publishing,
2014),
Vol. 513, p. 032062
2. Turilli M.,
Santcroos M.,
Jha S.,
A comprehensive perspective on pilot-job systems
(
ACM,
2018),
Vol. 51, p. 43
3. Henderson R.L.,
Job scheduling under the portable batch system, in Workshop on Job Scheduling Strategies for Parallel Processing
(
Springer,
1995),
pp. 279–294
4. Oleynik D.,
Panitkin S.,
Turilli M.,
Angius A.,
Oral S.H.,
De K.,
Klimentov A.,
Wells J.C.,
Jha S.,
High-Throughput Computing on High-Performance Platforms: A Case Study,
in 13th IEEE International Conference on e-Science
(2017),
pp. 295–304