Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy-Reference-Cited by-同舟云学术

Synthetic Tabular Data Based on Generative Adversarial Networks in Health Care: Generation and Validation Using the Divide-and-Conquer Strategy

Published:2023-11-24 Issue: Volume:11 Page:e47859
ISSN:2291-9694
Container-title:JMIR Medical Informatics
language:en
Short-container-title:JMIR Med Inform

Author:

Kang Ha Ye Jin^ORCID,Batbaatar Erdenebileg^ORCID,Choi Dong-Woo^ORCID,Choi Kui Son^ORCID,Ko Minsam^ORCID,Ryu Kwang Sun^ORCID

Abstract

Background Synthetic data generation (SDG) based on generative adversarial networks (GANs) is used in health care, but research on preserving data with logical relationships with synthetic tabular data (STD) remains challenging. Filtering methods for SDG can lead to the loss of important information. Objective This study proposed a divide-and-conquer (DC) method to generate STD based on the GAN algorithm, while preserving data with logical relationships. Methods The proposed method was evaluated on data from the Korea Association for Lung Cancer Registry (KALC-R) and 2 benchmark data sets (breast cancer and diabetes). The DC-based SDG strategy comprises 3 steps: (1) We used 2 different partitioning methods (the class-specific criterion distinguished between survival and death groups, while the Cramer V criterion identified the highest correlation between columns in the original data); (2) the entire data set was divided into a number of subsets, which were then used as input for the conditional tabular generative adversarial network and the copula generative adversarial network to generate synthetic data; and (3) the generated synthetic data were consolidated into a single entity. For validation, we compared DC-based SDG and conditional sampling (CS)–based SDG through the performances of machine learning models. In addition, we generated imbalanced and balanced synthetic data for each of the 3 data sets and compared their performance using 4 classifiers: decision tree (DT), random forest (RF), Extreme Gradient Boosting (XGBoost), and light gradient-boosting machine (LGBM) models. Results The synthetic data of the 3 diseases (non–small cell lung cancer [NSCLC], breast cancer, and diabetes) generated by our proposed model outperformed the 4 classifiers (DT, RF, XGBoost, and LGBM). The CS- versus DC-based model performances were compared using the mean area under the curve (SD) values: 74.87 (SD 0.77) versus 63.87 (SD 2.02) for NSCLC, 73.31 (SD 1.11) versus 67.96 (SD 2.15) for breast cancer, and 61.57 (SD 0.09) versus 60.08 (SD 0.17) for diabetes (DT); 85.61 (SD 0.29) versus 79.01 (SD 1.20) for NSCLC, 78.05 (SD 1.59) versus 73.48 (SD 4.73) for breast cancer, and 59.98 (SD 0.24) versus 58.55 (SD 0.17) for diabetes (RF); 85.20 (SD 0.82) versus 76.42 (SD 0.93) for NSCLC, 77.86 (SD 2.27) versus 68.32 (SD 2.37) for breast cancer, and 60.18 (SD 0.20) versus 58.98 (SD 0.29) for diabetes (XGBoost); and 85.14 (SD 0.77) versus 77.62 (SD 1.85) for NSCLC, 78.16 (SD 1.52) versus 70.02 (SD 2.17) for breast cancer, and 61.75 (SD 0.13) versus 61.12 (SD 0.23) for diabetes (LGBM). In addition, we found that balanced synthetic data performed better. Conclusions This study is the first attempt to generate and validate STD based on a DC approach and shows improved performance using STD. The necessity for balanced SDG was also demonstrated.

Publisher

JMIR Publications Inc.

Subject

Health Information Management,Health Informatics

Reference38 articles.

1. A targeted real-time early warning score (TREWScore) for septic shock

2. Shifting machine learning for healthcare from development to deployment and from models to data

3. Generative adversarial networks

4. Generating Synthetic ECGs Using GANs for Anonymizing Healthcare Data

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring Innovative Approaches to Synthetic Tabular Data Generation;Electronics;2024-05-17

2. Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Models (CREMLS);Journal of Medical Internet Research;2024-05-02

3. Consolidated Reporting Guidelines for Prognostic and Diagnostic Machine Learning Models (CREMLS) (Preprint);2024-04-04

4. Tabular Transformer Generative Adversarial Network for Heterogeneous distribution in healthcare;2024-03-25

5. An Improved Lung Cancer Prediction Algorithm using Generative Adversarial Network in Modern Healthcare;2024 International Conference on Integrated Circuits and Communication Systems (ICICACS);2024-02-23