Establishing Best Practices for Generating Synthetic Health Data: Model Development and Validation (Preprint)

Author:

Karimian Sichani Elnaz,Smith Aaron,El Emam KhaledORCID,Mosquera Lucy

Abstract

BACKGROUND

Electronic Health Record (EHR) is a valuable source of patient information that must be properly de-identified before it can be shared with researchers, which requires expertise and time. On the other hand, synthetic data has considerably reduced the restrictions on the use and sharing of real data, allowing researchers to access it more rapidly with far fewer privacy constraints. There has been a growing interest in establishing a method to generate synthetic data that protects patients' privacy while properly reflecting the data.

OBJECTIVE

The goal of this paper is to develop a model that generates valuable synthetic longitudinal health data while protecting the privacy of the patients whose data is collected.

METHODS

In this paper, we investigate the best model for generating synthetic health data, with a focus on longitudinal observations. We develop a generative model that relies on the generalized canonical polyadic (GCP) tensor decomposition. This model also involves sampling from a latent factor matrix that contains patient information using sequential decision trees, Coupla, and Hamiltonian Monte Carlo methods. The model is applied on samples from the MIMIC-III dataset. Numerous analyses and experiments were conducted in order to develop a method that would provide optimal results.

RESULTS

In certain experiments, all simulation methods used in the model produced the same high level of performance. Our proposed model is capable of addressing the challenge of sampling patients from electronic health records. This means that we can simulate a variety of patients in the synthetic dataset, which may differ in number from the patients in the original data. The analysis and research findings have revealed that our model is a promising method for generating longitudinal health data.

CONCLUSIONS

We have presented a generative model for producing synthetic longitudinal health data. The model is formulated by applying the generalized CP decomposition. We have provided three approaches for the synthesis and simulation of a latent factor matrix, following the process of factorization. In short, we have reduced the challenge of synthesizing massive longitudinal health data to synthesizing a non-longitudinal and significantly smaller dataset.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3