DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis-Reference-Cited by-同舟云学术

DataPackageR: Reproducible data preprocessing, standardization and sharing using R/Bioconductor for collaborative data analysis

Published:2018-07-10 Issue: Volume:2 Page:31
ISSN:2572-4754
Container-title:Gates Open Research
language:en
Short-container-title:Gates Open Res

Author:

Finak Greg,Mayer Bryan,Fulp William,Obrecht Paul,Sato Alicia,Chung Eva,Holman Drienna,Gottardo Raphael

Abstract

A central tenet of reproducible research is that scientific results are published along with the underlying data and software code necessary to reproduce and verify the findings. A host of tools and software have been released that facilitate such work-flows and scientific journals have increasingly demanded that code and primary data be made available with publications. There has been little practical advice on implementing reproducible research work-flows for large ’omics’ or systems biology data sets used by teams of analysts working in collaboration. In such instances it is important to ensure all analysts use the same version of a data set for their analyses. Yet, instantiating relational databases and standard operating procedures can be unwieldy, with high "startup" costs and poor adherence to procedures when they deviate substantially from an analyst’s usual work-flow. Ideally a reproducible research work-flow should fit naturally into an individual’s existing work-flow, with minimal disruption. Here, we provide an overview of how we have leveraged popular open source tools, including Bioconductor, Rmarkdown, git version control, R, and specifically R’s package system combined with a new tool DataPackageR, to implement a lightweight reproducible research work-flow for preprocessing large data sets, suitable for sharing among small-to-medium sized teams of computational scientists. Our primary contribution is the DataPackageR tool, which decouples time-consuming data processing from data analysis while leaving a traceable record of how raw data is processed into analysis-ready data sets. The software ensures packaged data objects are properly documented and performs checksum verification of these along with basic package version management, and importantly, leaves a record of data processing code in the form of package vignettes. Our group has implemented this work-flow to manage, analyze and report on pre-clinical immunological trial data from multi-center, multi-assay studies for the past three years.

Funder

National Institute of General Medical Sciences

Bill and Melinda Gates Foundation

Publisher

F1000 Research Ltd

Subject

Public Health, Environmental and Occupational Health,Health Policy,Immunology and Microbiology (miscellaneous),Biochemistry, Genetics and Molecular Biology (miscellaneous),Medicine (miscellaneous)

Reference39 articles.

1. What information should be required to support clinical "omics" publications?;K Baggerly;Clin Chem.,2011

2. Statistical analyses and reproducible research.;R Gentleman,2004

3. Packaging data analytical work reproducibly using R (and friends);B Marwick;PeerJ Preprints,2018

4. Enabling reproducible research: Open licensing for scientific innovation.;V Stodden;International Journal of Communications Law and Policy.,2009

5. Publishing standards for computational science: "Setting the default to reproducible";V Stodden,2013

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Artificial Intelligence inspired method for cross-lingual cyberhate detection from low resource languages;ACM Transactions on Asian and Low-Resource Language Information Processing;2024-08-16

2. Big data-based identification of methylated genes associated with drug resistance and prognosis in ovarian cancer;Medicine;2020-07-02

3. Essential guidelines for computational method benchmarking;Genome Biology;2019-06-20

4. Datastorr: a workflow and package for delivering successive versions of 'evolving data' directly into R;GigaScience;2019-05-01

5. The Computational article format: Software as a research output;Cytometry Part A;2018-12