Exploring the tradeoff between data privacy and utility with a clinical data analysis use case-Reference-Cited by-同舟云学术

Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

Published:2024-05-30 Issue:1 Volume:24 Page:
ISSN:1472-6947
Container-title:BMC Medical Informatics and Decision Making
language:en
Short-container-title:BMC Med Inform Decis Mak

Author:

Im Eunyoung,Kim Hyeoneui,Lee Hyungbok,Jiang Xiaoqian,Kim Ju Han

Abstract

Abstract Background Securing adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. Methods Predictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. Results All 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. Conclusions As the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data’s intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.

Funder

Korean Ministry of Health and Welfare

Ministry of Education

National Research Foundation of Korea

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s12911-024-02545-9.pdf

Reference54 articles.

1. Price WN, Cohen IG. Privacy in the age of medical big data. Nat Med. 2019;25(1):37–43.

2. Gostin LO, Halabi SF, Wilson K. Health data and privacy in the digital era. JAMA. 2018;320(3):233–4.

3. Data Protection and Privacy Legislation Worldwide | UNCTAD. https://unctad.org/page/data-protection-and-privacy-legislation-worldwide. Accessed 6 Oct 2022.

4. Health and Human Services. Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html#coveredentities (2022). Accessed 28 Mar 2024.

5. General Data Protection Regulation (GDPR). Article 32 GDPR(https://gdprhub.eu/index.php?title=Article_32_GDPR (2023). Accessed 4 Apr 2024.