A study of deep learning methods for de-identification of clinical notes in cross-institute settings-Reference-Cited by-同舟云学术

A study of deep learning methods for de-identification of clinical notes in cross-institute settings

Published:2019-12 Issue:S5 Volume:19 Page:
ISSN:1472-6947
Container-title:BMC Medical Informatics and Decision Making
language:en
Short-container-title:BMC Med Inform Decis Mak

Author:

Yang Xi,Lyu Tianchen,Li Qian,Lee Chih-Yin,Bian Jiang,Hogan William R.,Wu Yonghui

Abstract

Abstract Background De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions. Methods We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources. Results Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively. Conclusions It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.

Publisher

Springer Science and Business Media LLC

Subject

Health Informatics,Health Policy,Computer Science Applications

Link

http://link.springer.com/content/pdf/10.1186/s12911-019-0935-4.pdf

Reference41 articles.

1. Meystre SM, Friedlin FJ, South BR, Shen S, Samore MH. Automatic de-identification of textual documents in the electronic health record: a review of recent research. BMC Med Res Methodol. 2010;10:70.

2. Kayaalp M. Patient privacy in the era of big data. Balkan Med J. 2018;35:8–17.

3. Kayaalp M, Browne AC, Sagan P, McGee T, McDonald CJ. Challenges and insights in using HIPAA privacy rule for clinical text annotation. AMIA Annu Symp Proc. 2015;2015:707–16.

4. South BR, Mowery D, Suo Y, Leng J, Ferrández Ó, Meystre SM, et al. Evaluating the effects of machine pre-annotation and an interactive annotation interface on manual de-identification of clinical text. J Biomed Inform. 2014;50:162–72.

5. Dorr DA, Phillips WF, Phansalkar S, Sims SA, Hurdle JF. Assessing the difficulty and time cost of De-identification in clinical narratives. Methods Inf Med. 2018;45:246–52.

Cited by 48 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Predicting Respiratory Diseases Attributed to PM2.5 Air Pollution in Nairobi County Using Random Forest Model;International Journal of Innovative Science and Research Technology (IJISRT);2024-09-04

2. Outpatient reception via collaboration between nurses and a large language model: a randomized controlled trial;Nature Medicine;2024-07-15

3. GAN-Enhanced Vocal Biomarker Analysis for Respiratory Health Assessment;International Journal of Advanced Research in Science, Communication and Technology;2024-06-14

4. De-identification of clinical free text using natural language processing: A systematic review of current approaches;Artificial Intelligence in Medicine;2024-05

5. Generative large language models are all-purpose text analytics engines: text-to-text learning is all your need;Journal of the American Medical Informatics Association;2024-04-17