BACKGROUND
The use of medical data often faces challenges due to personal information protection issues. To solve this problem, methods for synthesizing data using generative models are gaining attention. Synthesizing medical data using generative models is a new field, and there are no established methods for evaluating the quality of synthetic data.
OBJECTIVE
To synthesize medical data using real-world time-series generative adversarial networks (RTSGAN), evaluate the quality of synthetic data through quantitative and qualitative methods, apply them to real-world medical artificial intelligence (AI) models, and assess the disclosure risk of synthetic data.
METHODS
Data were synthesized using the RTSGAN based on a real dataset of 15,799 patients with colorectal cancer. The quality of the synthetic data was evaluated using quantitative methods such as the Hellinger distance; train on synthetic, test on real (TSTR); train on real, test on synthetic (TRTS); and propensity mean squared error (PMSE); and qualitative methods including t-SNE and histogram. We applied the synthetic data to a real-world model predicting the five-year survival of patients with colorectal cancer. Thereafter, we compared its performance with a model using real data, employing measures such as the C-index, Brier score, and integrated Brier score (IBS). Finally, we measured the distance between the synthetic and real data using DCR (Distance to Closest Records) to assess the potential for privacy exposure.
RESULTS
In total, 53,005 data points were obtained. The Hellinger distance ranged from 0 to 0.25. The TSTR and TRTS results showed an average area under the curve of 0.99 and 0.98 and a propensity MSE was 0.223. The synthetic and real data were confirmed to be similar in the t-SNE and histogram analyses. The C-index, Brier score, and IBS for the models using synthetic data and real data were 0.742, 0.075, 0.107, 0.777, 0.098, and 0.138, respectively. The DCR evaluation shows that the minimum distance between real data is 2.45, and the minimum distance between real and synthetic data is 3.46.
CONCLUSIONS
It is feasible to synthesize combined time-series and static medical data using the RTSGAN, and the synthetic data can be evaluated to accurately reflect the characteristics of real data through quantitative and qualitative methods as well as by utilizing real-world AI models. Additionally, our investigation confirms that the synthesized data poses no privacy concerns.