Synthetic Tabular Data Evaluation in the Health Domain Covering Resemblance, Utility, and Privacy Dimensions

Author:

Hernadez Mikel1,Epelde Gorka12,Alberdi Ane3,Cilla Rodrigo1,Rankin Debbie4

Affiliation:

1. Digital Health and Biomedical Technologies Department, Vicomtech Foundation, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain

2. eHealth Group, Biodonostia Health Research Institute, Donostia-San Sebastian, Spain

3. Biomedical Engineering Department, Mondragon Unibertsitatea, Arrasate-Mondragón, Spain

4. School of Computing, Engineering and Intelligent Systems, Ulster University, Derry-Londonderry, United Kingdom

Abstract

Abstract Background Synthetic tabular data generation is a potentially valuable technology with great promise for data augmentation and privacy preservation. However, prior to adoption, an empirical assessment of generated synthetic tabular data is required across dimensions relevant to the target application to determine its efficacy. A lack of standardized and objective evaluation and benchmarking strategy for synthetic tabular data in the health domain has been found in the literature. Objective The aim of this paper is to identify key dimensions, per dimension metrics, and methods for evaluating synthetic tabular data generated with different techniques and configurations for health domain application development and to provide a strategy to orchestrate them. Methods Based on the literature, the resemblance, utility, and privacy dimensions have been prioritized, and a collection of metrics and methods for their evaluation are orchestrated into a complete evaluation pipeline. This way, a guided and comparative assessment of generated synthetic tabular data can be done, categorizing its quality into three categories (“Excellent,” “Good,” and “Poor”). Six health care-related datasets and four synthetic tabular data generation approaches have been chosen to conduct an analysis and evaluation to verify the utility of the proposed evaluation pipeline. Results The synthetic tabular data generated with the four selected approaches has maintained resemblance, utility, and privacy for most datasets and synthetic tabular data generation approach combination. In several datasets, some approaches have outperformed others, while in other datasets, more than one approach has yielded the same performance. Conclusion The results have shown that the proposed pipeline can effectively be used to evaluate and benchmark the synthetic tabular data generated by various synthetic tabular data generation approaches. Therefore, this pipeline can support the scientific community in selecting the most suitable synthetic tabular data generation approaches for their data and application of interest.

Funder

Department of Economic Development and Infrastructure of the Basque Government through Emaitek Plus Action Plan Programme

Publisher

Georg Thieme Verlag KG

Subject

Health Information Management,Advanced and Specialized Nursing,Health Informatics

Reference39 articles.

1. Discussion statistical disclosure limitation;D B Rubin;J Off Stat,1993

2. Statistical Analysis of Masked Data;R JA Little;J Off Stat,1993

3. The synthetic data paradigm for using and sharing data;K El Emam;DATA Anal Digit Technol,2019

4. Synthetic data generation for tabular health records: a systematic review;M Hernandez;Neurocomputing,2022

5. New approaches to data dissemination: a glimpse into the future;J P Reiter;Chance,2004

Cited by 15 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3