Generating high-fidelity synthetic patient data for assessing machine learning healthcare software-Reference-Cited by-同舟云学术

Generating high-fidelity synthetic patient data for assessing machine learning healthcare software

Published:2020-11-09 Issue:1 Volume:3 Page:
ISSN:2398-6352
Container-title:npj Digital Medicine
language:en
Short-container-title:npj Digit. Med.

Author:

Tucker Allan^ORCID,Wang Zhenchen,Rotalinti Ylenia,Myles Puja^ORCID

Abstract

Abstract There is a growing demand for the uptake of modern artificial intelligence technologies within healthcare systems. Many of these technologies exploit historical patient health data to build powerful predictive models that can be used to improve diagnosis and understanding of disease. However, there are many issues concerning patient privacy that need to be accounted for in order to enable this data to be better harnessed by all sectors. One approach that could offer a method of circumventing privacy issues is the creation of realistic synthetic data sets that capture as many of the complexities of the original data set (distributions, non-linear relationships, and noise) but that does not actually include any real patient data. While previous research has explored models for generating synthetic data sets, here we explore the integration of resampling, probabilistic graphical modelling, latent variable identification, and outlier analysis for producing realistic synthetic data based on UK primary care patient data. In particular, we focus on handling missingness, complex interactions between variables, and the resulting sensitivity analysis statistics from machine learning classifiers, while quantifying the risks of patient re-identification from synthetic datapoints. We show that, through our approach of integrating outlier analysis with graphical modelling and resampling, we can achieve synthetic data sets that are not significantly different from original ground truth data in terms of feature distributions, feature dependencies, and sensitivity analysis statistics when inferring machine learning classifiers. What is more, the risk of generating synthetic data that is identical or very similar to real patients is shown to be low.

Funder

Innovate UK

Regulators’ Pioneer Fund, The Department for Business, Energy and Industrial Strategy (BEIS), administered by Innovate UK

Publisher

Springer Science and Business Media LLC

Subject

Health Information Management,Health Informatics,Computer Science Applications,Medicine (miscellaneous)

Link

http://www.nature.com/articles/s41746-020-00353-9.pdf

Reference56 articles.

1. The Lancet Editorial. Personalised medicine in the UK. Lancet, 391, e1 (2018).

2. FDA. Proposed Regulatory Framework for Modification to Artificial Intelligence / Machine Learning (AI/ML)–Based Software as a Medical Device (SaMD). https://www.fda.gov/media/122535/download (2020).

3. Goodman, B. & Flaxman, S. European Union regulations on algorithmic decision-making and a right to explanation. Preprint at http://arxiv.org/abs/1606.08813 (2016).

4. BBC 2017. Google DeepMind NHS app test broke UK privacy law. https://www.bbc.co.uk/news/technology-40483202 (2017).

5. Wachter, S., Mittelstadt, B. & Floridi, L. Why a right to explanation of automated decision-making does not exist in the general data protection regulation, International Data Privacy Law. https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2903469 (2016).

Cited by 84 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Generation of a Realistic Synthetic Laryngeal Cancer Cohort for AI Applications;Cancers;2024-02-01

2. Machine Learning for Smart Health Services in the Framework of Industry 5.0;Infrastructure Possibilities and Human-Centered Approaches With Industry 5.0;2024-01-25

3. High-Fidelity Synthetic Data Applications for Data Augmentation;Artificial Intelligence;2024-01-12

4. Current Status and Future Directions: The Application of Artificial Intelligence/Machine Learning for Precision Medicine;Clinical Pharmacology & Therapeutics;2024-01-03

5. PVS-GEN: Systematic Approach for Universal Synthetic Data Generation Involving Parameterization, Verification, and Segmentation;Sensors;2024-01-02