Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation-Reference-Cited by-同舟云学术

Fake It Till You Make It: Guidelines for Effective Synthetic Data Generation

Published:2021-02-28 Issue:5 Volume:11 Page:2158
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Dankar Fida K.,Ibrahim Mahmoud

Abstract

Synthetic data provides a privacy protecting mechanism for the broad usage and sharing of healthcare data for secondary purposes. It is considered a safe approach for the sharing of sensitive data as it generates an artificial dataset that contains no identifiable information. Synthetic data is increasing in popularity with multiple synthetic data generators developed in the past decade, yet its utility is still a subject of research. This paper is concerned with evaluating the effect of various synthetic data generation and usage settings on the utility of the generated synthetic data and its derived models. Specifically, we investigate (i) the effect of data pre-processing on the utility of the synthetic data generated, (ii) whether tuning should be applied to the synthetic datasets when generating supervised machine learning models, and (iii) whether sharing preliminary machine learning results can improve the synthetic data models. Lastly, (iv) we investigate whether one utility measure (Propensity score) can predict the accuracy of the machine learning models generated from the synthetic data when employed in real life. We use two popular measures of synthetic data utility, propensity score and classification accuracy, to compare the different settings. We adopt a recent mechanism for the calculation of propensity, which looks carefully into the choice of model for the propensity score calculation. Accordingly, this paper takes a new direction with investigating the effect of various data generation and usage settings on the quality of the generated data and its ensuing models. The goal is to inform on the best strategies to follow when generating and using synthetic data.

Funder

UAEU UPAR Grant

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/11/5/2158/pdf

Reference48 articles.

1. The potential for artificial intelligence in healthcare

2. AI-Assisted Decision-making in Healthcare

3. Developing a data infrastructure for a learning health system: the PORTAL network

4. Estimating the re-identification risk of clinical data sets

Cited by 55 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. In silico assessment of nanoparticle toxicity powered by the Enalos Cloud Platform: Integrating automated machine learning and synthetic data for enhanced nanosafety evaluation;Computational and Structural Biotechnology Journal;2024-12

2. A Systematic Review of Synthetic Data Generation Techniques Using Generative AI;Electronics;2024-09-04

3. Multimodal Transformers and Their Applications in Drug Target Discovery for Aging and Age-Related Diseases;The Journals of Gerontology, Series A: Biological Sciences and Medical Sciences;2024-08-10

4. Synthetic Data and its Utility in Pathology and Laboratory Medicine;Laboratory Investigation;2024-08

5. Latent Diffusion Models with Image-Derived Annotations for Enhanced AI-Assisted Cancer Diagnosis in Histopathology;Diagnostics;2024-07-05