Abstract
Abstract
Background
Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets.
Objective
We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized.
Methods
We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset.
Results
Fully customized systems remove 97–99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems.
Conclusion
Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level.
Publisher
Springer Science and Business Media LLC
Subject
Health Informatics,Health Policy,Computer Science Applications
Reference40 articles.
1. Chen X, Xie H, Wang FL, Liu Z, Xu J, Hao T. A bibliometric analysis of natural language processing in medical research. BMC Med Inform Decis Mak. 2018;18(Suppl 1):14.
2. PubMed search conducted 23 April 2018 using the following URL https://www.ncbi.nlm.nih.gov/pubmed/?term=%22free+text%22+OR+%22unstructured+text%22 showed 89 results in 2007 and 460 results in 2018.
3. Névéol A, Zweigenbaum P. Clinical natural language processing in 2014: foundational methods supporting efficient healthcare. Yearb Med Inform. 2015;10(1):194–8.
4. Meystre SM, Ferrández Ó, Friedlin FJ, South BR, Shen S, Samore MH. Text de-identification for privacy protection: a study of its impact on clinical text information content. J Biomed Inform. 2014;50:142–50.
5. Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc. 2017;24(3):596–606.
Cited by
17 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献