Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning-Reference-Cited by-同舟云学术

Umpire 2.0: Simulating realistic, mixed-type, clinical data for machine learning

Published:2021-03-05 Issue: Volume:9 Page:1186
ISSN:2046-1402
Container-title:F1000Research
language:en
Short-container-title:F1000Res

Author:

Coombes Caitlin E.,Abrams Zachary B.,Nakayiza Samantha,Brock Guy,Coombes Kevin R.^ORCID

Abstract

The Umpire 2.0 R-package offers a streamlined, user-friendly workflow to simulate complex, heterogeneous, mixed-type data with known subgroup identities, dichotomous outcomes, and time-to-event data, while providing ample opportunities for fine-tuning and flexibility. Here, we describe how we have expanded the core Umpire 1.0 R-package, developed to simulate gene expression data, to generate clinically realistic, mixed-type data for use in evaluating unsupervised and supervised machine learning (ML) methods. As the availability of large-scale clinical data for ML has increased, clinical data has posed unique challenges, including widely variable size, individual biological heterogeneity, data collection and measurement noise, and mixed data types. Developing and validating ML methods for clinical data requires data sets with known ground truth, generated from simulation. Umpire 2.0 addresses challenges to simulating realistic clinical data by providing the user a series of modules to generate survival parameters and subgroups, apply meaningful additive noise, and discretize to single or mixed data types. Umpire 2.0 provides broad functionality across sample sizes, feature spaces, and data types, allowing the user to simulate correlated, heterogeneous, binary, continuous, categorical, or mixed type data from the scale of a small clinical trial to data on thousands of patients drawn from electronic health records. The user may generate elaborate simulations by varying parameters in order to compare algorithms or interrogate operating characteristics of an algorithm in both supervised and unsupervised ML.

Funder

National Center for Advancing Translational Sciences

National Cancer Institute

Publisher

F1000 Research Ltd

Subject

General Pharmacology, Toxicology and Pharmaceutics,General Immunology and Microbiology,General Biochemistry, Genetics and Molecular Biology,General Medicine

Link

https://f1000research.com/articles/9-1186/v2/pdf

Reference21 articles.

1. Big data analytics in healthcare: promise and potential.;W Raghupathi;Health Inf Sci Syst.,2014

2. The rise of big clinical databases.;J Cook;Br J Surg.,2015

3. Unsupervised machine learning and prognostic factors of survival in chronic lymphocytic leukemia.;C Coombes;J Am Med Inform Assoc.,2020

4. Do COPD subtypes really exist? COPD heterogeneity and clustering in 10 independent cohorts.;P Castaldi;Thorax.,2017

5. Identifying clinically important COPD sub-types using data-driven approaches in primary care population based electronic health records.;M Pikoula;BMC Med Inform Decis Mak.,2019

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SillyPutty: Improved clustering by optimizing the silhouette width;2023-11-11

2. Subsampling and Aggregation: A Solution to the Scalability Problem in Distance-Based Prediction for Mixed-Type Data;Mathematics;2021-09-13

3. Simulation-derived best practices for clustering clinical data;Journal of Biomedical Informatics;2021-06