DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates-Reference-Cited by-同舟云学术

DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates

Published:2023-01-01 Issue:1 Volume:3 Page:
ISSN:2635-0041
Container-title:Bioinformatics Advances
language:en
Short-container-title:

Author:

Cardoen Ben¹^ORCID,Ben Yedder Hanene¹^ORCID,Lee Sieun²³^ORCID,Nabi Ivan Robert⁴⁵^ORCID,Hamarneh Ghassan¹^ORCID

Affiliation:

1. Department of Computing Science, Simon Fraser University , 8888 University Dr W , Burnaby, British Columbia V5A1S6, Canada

2. Precision Imaging Beacon, University of Nottingham , Nottingham NG7 2RD, UK

3. Department of Mental Health and Clinical Neuroscience, University of Nottingham , Nottingham NG7 2UH, UK

4. Life Sciences Institute, University of British Columbia , Vancouver, British Columbia V6T 1Z3, Canada

5. School of Biomedical Engineering, University of British Columbia , Vancouver, British Columbia V6T 1Z3, Canada

Abstract

Summary Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and effort to be corrected by domain experts. Poor data curation can also interrupt processing jobs on large computing clusters, causing frustration and delays. We introduce DataCurator, a portable software package that verifies arbitrarily complex datasets of mixed formats, working equally well on clusters as on local systems. Human-readable TOML recipes are converted into executable, machine-verifiable templates, enabling users to easily verify datasets using custom rules without writing code. Recipes can be used to transform and validate data, for pre- or post-processing, selection of data subsets, sampling and aggregation, such as summary statistics. Processing pipelines no longer need to be burdened by laborious data validation, with data curation and validation replaced by human and machine-verifiable recipes specifying rules and actions. Multithreaded execution ensures scalability on clusters, and existing Julia, R and Python libraries can be reused. DataCurator enables efficient remote workflows, offering integration with Slack and the ability to transfer curated data to clusters using OwnCloud and SCP. Code available at: https://github.com/bencardoen/DataCurator.jl.

Publisher

Oxford University Press (OUP)

Subject

Computer Science Applications,Genetics,Molecular Biology,Structural Biology

Link

https://academic.oup.com/bioinformaticsadvances/advance-article-pdf/doi/10.1093/bioadv/vbad068/50503110/vbad068.pdf

Reference10 articles.

1. Julia: a fresh approach to numerical computing;Bezanson;SIAM Rev,2017

2. Data preprocessing and intelligent data analysis;Famili;IDA,1997

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. AI analysis of super-resolution microscopy: Biological discovery in the absence of ground truth;Journal of Cell Biology;2024-06-12