Exploring the tradeoff between data privacy and utility with a clinical data analysis use case
-
Published:2024-05-30
Issue:1
Volume:24
Page:
-
ISSN:1472-6947
-
Container-title:BMC Medical Informatics and Decision Making
-
language:en
-
Short-container-title:BMC Med Inform Decis Mak
Author:
Im Eunyoung,Kim Hyeoneui,Lee Hyungbok,Jiang Xiaoqian,Kim Ju Han
Abstract
Abstract
Background
Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility.
Methods
Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two.
Results
All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores.
Conclusions
As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data’s intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.
Funder
Korean Ministry of Health and Welfare
Ministry of Education
National Research Foundation of Korea
Publisher
Springer Science and Business Media LLC
Reference54 articles.
1. Price WN, Cohen IG. Privacy in the age of medical big data. Nat Med. 2019;25(1):37–43.
2. Gostin LO, Halabi SF, Wilson K. Health data and privacy in the digital era. JAMA. 2018;320(3):233–4.
3. Data Protection and Privacy Legislation Worldwide | UNCTAD. https://unctad.org/page/data-protection-and-privacy-legislation-worldwide. Accessed 6 Oct 2022.
4. Health and Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#coveredentities (2022). Accessed 28 Mar 2024.
5. General Data Protection Regulation (GDPR). Article 32 GDPR(https://gdprhub.eu/index.php?title=Article_32_GDPR (2023). Accessed 4 Apr 2024.