Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study-Reference-Cited by-同舟云学术

Exploring the Privacy-Preserving Properties of Word Embeddings: Algorithmic Validation Study

Published:2020-07-15 Issue:7 Volume:22 Page:e18055
ISSN:1438-8871
Container-title:Journal of Medical Internet Research
language:en
Short-container-title:J Med Internet Res

Author:

Abdalla Mohamed^ORCID,Abdalla Moustafa^ORCID,Hirst Graeme^ORCID,Rudzicz Frank^ORCID

Abstract

Background Word embeddings are dense numeric vectors used to represent language in neural networks. Until recently, there had been no publicly released embeddings trained on clinical data. Our work is the first to study the privacy implications of releasing these models. Objective This paper aims to demonstrate that traditional word embeddings created on clinical corpora that have been deidentified by removing personal health information (PHI) can nonetheless be exploited to reveal sensitive patient information. Methods We used embeddings created from 400,000 doctor-written consultation notes and experimented with 3 common word embedding methods to explore the privacy-preserving properties of each. Results We found that if publicly released embeddings are trained from a corpus anonymized by PHI removal, it is possible to reconstruct up to 68.5% (n=411/600) of the full names that remain in the deidentified corpus and associated sensitive information to specific patients in the corpus from which the embeddings were created. We also found that the distance between the word vector representation of a patient’s name and a diagnostic billing code is informative and differs significantly from the distance between the name and a code not billed for that patient. Conclusions Special care must be taken when sharing word embeddings created from clinical texts, as current approaches may compromise patient privacy. If PHI removal is used for anonymization before traditional word embeddings are trained, it is possible to attribute sensitive information to patients who have not been fully deidentified by the (necessarily imperfect) removal algorithms. A promising alternative (ie, anonymization by PHI replacement) may avoid these flaws. Our results are timely and critical, as an increasing number of researchers are pushing for publicly available health data.

Publisher

JMIR Publications Inc.

Subject

Health Informatics

Reference23 articles.

1. Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives

2. A comparison of word embeddings for the biomedical natural language processing

3. Glove: Global Vectors for Word Representation

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Microsnoop: A generalist tool for microscopy image representation;The Innovation;2024-01

2. A privacy-preserving word embedding text classification model based on privacy boundary constructed by deep belief network;Multimedia Tools and Applications;2023-09-15

3. The Impact of Collaborative Documentation on Person-Centered Care: A Textual Analysis of Clinical Notes (Preprint);JMIR Medical Informatics;2023-09-12

4. The Impact of Collaborative Documentation on Person-Centered Care: A Textual Analysis of Clinical Notes (Preprint);2023-09-12

5. Journey to the center of the words: Word weighting scheme based on the geometry of word embeddings;34th International Conference on Scientific and Statistical Database Management;2022-07-06