Challenges for Implementing FAIR Digital Objects with High Performance Workflows

Author:

Pouchard LineORCID,Islam Tanzima,Nicolae Bogdan

Abstract

New types of workflows are being used in science that couple traditional distributed and high-performance computing (HPC) with data-intensive approaches, and orchestrate ensembles of numerical simulations and artificial intelligence (AI) models. Such workflows may use AI models to supplement computation where numerical simulations may be too computationally expensive, to automate trivial yet time consuming operations, to perform preliminary selections among intractable numbers of combinations in domains as diverse as protein binding, fine-grid climate simulations, and drug discovery. They offer renewed opportunities for scientific research but exhibit high computational, storage and communications requirements [Goble et al. 2020, Al-Saadi et al. 2021, da Silva et al. 2021]. These workflows can be orchestrated by workflow management systems (WMS) and built upon composable blocks that facilitate task placement and resource allocation for parallel executions on high performance systems [Lee et al. 2021, Merzky et al. 2021]. The scientific computing communities running these kinds of workflows have been slow to adopt Findable, Accessible, Interpretable, and Re-usable (FAIR) principles, in part due to the complexity of workflow life cycles, the numerous WMS, and the specificity of HPC systems with rapidly evolving architectures and software stacks, and execution modes that require resource managers and batch schedulers [Plale et al. 2021]. FAIR Digital Objects (FDO) that encapsulate bit sequences of data, metadata, types and persistent identifiers (PID) can help promote the adoption of FAIR, enable knowledge extraction and dissemination, and contribute to re-use [De Smedt et al. 2020]. As workflows typically use data and software during planning and execution, FDOs are particularly adapted to enable re-use [Wittenburg et al. 2020]. But the benefits of FDOs such as automating data processing and actionable DO collections cannot be realized without the main components of FAIR, rich metadata and clear identifiers, being universally adopted in the community. These components are still elusive for HPC digital objects. Some metadata are added after results have been produced, are not described by controlled vocabularies, and typically left unconstrained, resulting in inefficient processes and loss of knowledge. Persistent identifiers are added at the time of publication to data supporting conclusions, so only a very small amount of data are being shared outside a small community of researchers “in the know”. In this conceptual work, one can distinguish several kinds of FDOs for HPC workflows that present both common and specific challenges to the development of canonical DO infrastructure and the implementation of FDO workflows that we discuss below: result FDOs represent computational results obtained when program execution complete, performance FDOs that contain performance measures and results from code optimization on parallel, heterogeneous architectures, intermediate FDOs from intermediate states of workflow execution, obtained from HPC checkpointing. result FDOs represent computational results obtained when program execution complete, performance FDOs that contain performance measures and results from code optimization on parallel, heterogeneous architectures, intermediate FDOs from intermediate states of workflow execution, obtained from HPC checkpointing. All these FDOs for HPC workflows should include the computing environment and system specifications on which code was executed for metadata rich enough to enable re-usability [Pouchard et al. 2019]. Containers are often being used to capture dependencies between underlying libraries and versions in the execution environment for the installation and re-use of software code [Lofstead et al. 2015, Olaya et al. 2020]. But containers published in code repositories are made available without identifiers registered with resolvers. For instance, to attribute a Digital Object Identifier to software shared in github, one must perform the additional step of registering the code into Zenodo. FDOs extracted and built in the context of a canonical workflow framework including collections will help with the attribution of persistent identifiers and the linking of execution environment with data and workflow. Computational results may include machine learning predictions resulting form stochastic training of non-deterministic models. Neural networks and deep learning models present specific challenges to result FDOs related to provenance and the selection of quantities needed to include in an FDO for the re-use of results. What information needs to be included in a FAIR Digital Object encapsulating deep learning results to make it persistent and re-usable? The description of method, data and experiment recommended in [Gundersen and Kjensmo 2018] can be instantiated in a FDO collection. To make it re-usable, it should include the model architecture, the machine learning platform and its version, a submission script that contains hyperparameters, the loss function, batch size and number of epochs [Pouchard et al. 2020]. Challenges specific to digital objects containing performance measures for HPC workflows are those related to size, selection and reduction. Performance data at scale tends to be very large, thus a principled approach to selection is needed to determine which execution counters must be included in FDOs for performance reproducibility of an application [Patki et al. 2019]. Performance FDOs should include the variables selected to show their impact on performance and the methods used for selection: do such variables represent outliers in performance metrics? What methods and thresholds are used to qualify as outliers, what impact do these outliers have on overall performance of an execution? A key contributor to the failure to capture important information in HPC workflows is that metadata and provenance capture is often “bolted on” after the fact and in a piecemeal, cumbersome, inefficient manner that impedes further analysis. An FDO approach including DO collections at the appropriate level of abstraction and rich metadata is needed. Capturing metadata automatically must take into account the appropriate granularity level for re-use across system layers and abstraction levels. Intermediate FDOs capture and fuse metadata across multiple sources during the planning and execution stages [Nicolae 2022]. Some tools already exist. Darshan is a scalable tool summarizing Input/Output file characteristics [Dai et al. 2019], Radical Cybertools [Merzky et al. 2021] can produce the provenance task graph of an execution. Such tools could be included in a canonical workflow framework as they present a path forward for composable services for HPC and would guarantee a level of encapsulation into DOs favorable to re-use.

Publisher

Pensoft Publishers

Subject

General Medicine

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Building the I (Interoperability) of FAIR for Performance Reproducibility of Large-Scale Composable Workflows in RECUP;2023 IEEE 19th International Conference on e-Science (e-Science);2023-10-09

2. FAIR Enabling Re-Use of Data-Intensive Workflows and Scientific Reproducibility;Companion of the 2023 ACM/SPEC International Conference on Performance Engineering;2023-04-15

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3