Evaluating Cluster-Based Synthetic Data Generation for Blood-Transfusion Analysis

Author:

Kroes Shannon K. S.1234ORCID,van Leeuwen Matthijs2ORCID,Groenwold Rolf H. H.35ORCID,Janssen Mart P.4ORCID

Affiliation:

1. Netherlands Organisation for Applied Scientific Research (TNO), Anna van Buerenplein 1, 2595 DA The Hague, The Netherlands

2. Leiden Institute of Advanced Computer Science, Leiden University, 2333 CA Leiden, The Netherlands

3. Department of Clinical Epidemiology, Leiden University Medical Center, 2333 ZA Leiden, The Netherlands

4. Transfusion Technology Assessment Group, Donor Medicine Research Department, Sanquin Research, 1066 CX Amsterdam, The Netherlands

5. Department of Biomedical Data Sciences, Leiden University Medical Center, 2333 ZA Leiden, The Netherlands

Abstract

Synthetic data generation is becoming an increasingly popular approach to making privacy-sensitive data available for analysis. Recently, cluster-based synthetic data generation (CBSDG) has been proposed, which uses explainable and tractable techniques for privacy preservation. Although the algorithm demonstrated promising performance on simulated data, CBSDG has not yet been applied to real, personal data. In this work, a published blood-transfusion analysis is replicated with synthetic data to assess whether CBSDG can reproduce more complex and intricate variable relations than previously evaluated. Data from the Dutch national blood bank, consisting of 250,729 donation records, were used to predict donor hemoglobin (Hb) levels by means of support vector machines (SVMs). Precision scores were equal to the original data results for both male (0.997) and female (0.987) donors, recall was 0.007 higher for male and 0.003 lower for female donors (original estimates 0.739 and 0.637, respectively). The impact of the variables on Hb predictions was similar, as quantified and visualized with Shapley additive explanation values. Opportunities for attribute disclosure were decreased for all but two variables; only the binary variables Deferral Status and Sex could still be inferred. Such inference was also possible for donors who were not used as input for the generator and may result from correlations in the data as opposed to overfitting in the synthetic-data-generation process. The high predictive performance obtained with the synthetic data shows potential of CBSDG for practical implementation.

Funder

Sanquin Blood Supply Foundation

Publisher

MDPI AG

Subject

General Earth and Planetary Sciences,General Environmental Science

Reference35 articles.

1. Synthesizing electronic health records using improved generative adversarial networks;Baowaly;J. Am. Med. Inform. Assoc.,2019

2. Privacy and synthetic datasets;Bellovin;Stan. Tech. L. Rev.,2019

3. Gonzales, A., Guruswamy, G., and Smith, S.R. (2023). Synthetic data in health care: A narrative review. PLoS Digit. Health, 2.

4. Synthetic data use: Exploring use cases to optimise data utility;James;Discov. Artif. Intell.,2021

5. Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019, January 8–14). Modeling tabular data using conditional gan. Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3