Synthesis and Quality Assessment of Combined Time-Series and Static Medical Data Using a Real-world Time-Series Generative Adversarial Network (Preprint)

Author:

Song Kyoung DooORCID,Kim JaewonORCID,Choo HyunwooORCID,Shin Soo-YongORCID

Abstract

BACKGROUND

The use of medical data often faces challenges due to personal information protection issues. To solve this problem, methods for synthesizing data using generative models are gaining attention. Synthesizing medical data using generative models is a new field, and there are no established methods for evaluating the quality of synthetic data.

OBJECTIVE

To synthesize medical data using real-world time-series generative adversarial networks (RTSGAN), evaluate the quality of synthetic data through quantitative and qualitative methods, apply them to real-world medical artificial intelligence (AI) models, and assess the disclosure risk of synthetic data.

METHODS

Data were synthesized using the RTSGAN based on a real dataset of 15,799 patients with colorectal cancer. The quality of the synthetic data was evaluated using quantitative methods such as the Hellinger distance; train on synthetic, test on real (TSTR); train on real, test on synthetic (TRTS); and propensity mean squared error (PMSE); and qualitative methods including t-SNE and histogram. We applied the synthetic data to a real-world model predicting the five-year survival of patients with colorectal cancer. Thereafter, we compared its performance with a model using real data, employing measures such as the C-index, Brier score, and integrated Brier score (IBS). Finally, we measured the distance between the synthetic and real data using DCR (Distance to Closest Records) to assess the potential for privacy exposure.

RESULTS

In total, 53,005 data points were obtained. The Hellinger distance ranged from 0 to 0.25. The TSTR and TRTS results showed an average area under the curve of 0.99 and 0.98 and a propensity MSE was 0.223. The synthetic and real data were confirmed to be similar in the t-SNE and histogram analyses. The C-index, Brier score, and IBS for the models using synthetic data and real data were 0.742, 0.075, 0.107, 0.777, 0.098, and 0.138, respectively. The DCR evaluation shows that the minimum distance between real data is 2.45, and the minimum distance between real and synthetic data is 3.46.

CONCLUSIONS

It is feasible to synthesize combined time-series and static medical data using the RTSGAN, and the synthetic data can be evaluated to accurately reflect the characteristics of real data through quantitative and qualitative methods as well as by utilizing real-world AI models. Additionally, our investigation confirms that the synthesized data poses no privacy concerns.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3