Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis-Reference-Cited by-同舟云学术

Comparative assessment of synthetic time series generation approaches in healthcare: leveraging patient metadata for accurate data synthesis

Published:2024-01-30 Issue:1 Volume:24 Page:
ISSN:1472-6947
Container-title:BMC Medical Informatics and Decision Making
language:en
Short-container-title:BMC Med Inform Decis Mak

Author:

Isasa Imanol,Hernandez Mikel,Epelde Gorka,Londoño Francisco,Beristain Andoni,Larrea Xabat,Alberdi Ane,Bamidis Panagiotis,Konstantinidis Evdokimos

Abstract

Abstract Background Synthetic data is an emerging approach for addressing legal and regulatory concerns in biomedical research that deals with personal and clinical data, whether as a single tool or through its combination with other privacy enhancing technologies. Generating uncompromised synthetic data could significantly benefit external researchers performing secondary analyses by providing unlimited access to information while fulfilling pertinent regulations. However, the original data to be synthesized (e.g., data acquired in Living Labs) may consist of subjects’ metadata (static) and a longitudinal component (set of time-dependent measurements), making it challenging to produce coherent synthetic counterparts. Methods Three synthetic time series generation approaches were defined and compared in this work: only generating the metadata and coupling it with the real time series from the original data (A1), generating both metadata and time series separately to join them afterwards (A2), and jointly generating both metadata and time series (A3). The comparative assessment of the three approaches was carried out using two different synthetic data generation models: the Wasserstein GAN with Gradient Penalty (WGAN-GP) and the DöppelGANger (DGAN). The experiments were performed with three different healthcare-related longitudinal datasets: Treadmill Maximal Effort Test (TMET) measurements from the University of Malaga (1), a hypotension subset derived from the MIMIC-III v1.4 database (2), and a lifelogging dataset named PMData (3). Results Three pivotal dimensions were assessed on the generated synthetic data: resemblance to the original data (1), utility (2), and privacy level (3). The optimal approach fluctuates based on the assessed dimension and metric. Conclusion The initial characteristics of the datasets to be synthesized play a crucial role in determining the best approach. Coupling synthetic metadata with real time series (A1), as well as jointly generating synthetic time series and metadata (A3), are both competitive methods, while separately generating time series and metadata (A2) appears to perform more poorly overall.

Funder

Horizon 2020 Framework Programme

Department of Education, Universities and Research of the Basque Country

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s12911-024-02427-0.pdf

Reference26 articles.

1. European Union. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data. Available from: http://data.europa.eu/eli/reg/2016/679/oj.

2. Sweeney L, von Loewenfeldt M, Perry M. Saying it’s anonymous doesn’t make it so: re-identifications of anonymized law school data. Technol Sci. 2018;2018111301. Available from: https://techscience.org/a/2018111301/. Cited 2023 Feb 8.

3. Yoo JS, Ra Thaler A, Sweeney L, Zang J. Risks to patient privacy: a re-identification of patients in Maine and Vermont Statewide Hospital data. Technol Sci. 2018;2018100901. Available from: https://techscience.org/a/2018100901/. Cited 2023 Feb 8.

4. Rajotte JF, Bergen R, Buckeridge DL, Emam KE, Ng R, Strome E. Synthetic data as an enabler for machine learning applications in medicine. iScience . 2022;25(11). Available from: https://www.cell.com/iscience/abstract/S2589-0042(22)01603-0. Cited 2023 Feb 16.

5. Mitchell C, Hill ER. Are synthetic health data personal data?. PHG Foundation; 2023. Available from: https://www.phgfoundation.org/report/are-synthetic-health-data-personal-data. Cited 2023 Sept 27.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Secure Data Publishing and Access Service for Sensitive Data from Living Labs: Enabling Collaboration with External Researchers via Shareable Data;Big Data and Cognitive Computing;2024-05-28