Improving domain adaptation in de-identification of electronic health records through self-training-Reference-Cited by-同舟云学术

Improving domain adaptation in de-identification of electronic health records through self-training

Published:2021-08-07 Issue:10 Volume:28 Page:2093-2100
ISSN:1527-974X
Container-title:Journal of the American Medical Informatics Association
language:en
Short-container-title:

Author:

Liao Shun¹²,Kiros Jamie³,Chen Jiyang³,Zhang Zhaolei¹²,Chen Ting³

Affiliation:

1. Department of Computer Science, University of Toronto, Toronto, Ontario, Canada

2. Donnelly Centre for Cellular and Biomoleular Research, University of Toronto, Ontario, Canada

3. Google, Toronto, Ontario, Canada

Abstract

Abstract Objective De-identification is a fundamental task in electronic health records to remove protected health information entities. Deep learning models have proven to be promising tools to automate de-identification processes. However, when the target domain (where the model is applied) is different from the source domain (where the model is trained), the model often suffers a significant performance drop, commonly referred to as domain adaptation issue. In de-identification, domain adaptation issues can make the model vulnerable for deployment. In this work, we aim to close the domain gap by leveraging unlabeled data from the target domain. Materials and Methods We introduce a self-training framework to address the domain adaptation issue by leveraging unlabeled data from the target domain. We validate the effectiveness on 4 standard de-identification datasets. In each experiment, we use a pair of datasets: labeled data from the source domain and unlabeled data from the target domain. We compare the proposed self-training framework with supervised learning that directly deploys the model trained on the source domain. Results In summary, our proposed framework improves the F1-score by 5.38 (on average) when compared with direct deployment. For example, using i2b2-2014 as the training dataset and i2b2-2006 as the test, the proposed framework increases the F1-score from 76.61 to 85.41 (+8.8). The method also increases the F1-score by 10.86 for mimic-radiology and mimic-discharge. Conclusion Our work demonstrates an effective self-training framework to boost the domain adaptation performance for the de-identification task for electronic health records.

Publisher

Oxford University Press (OUP)

Subject

Health Informatics

Link

http://academic.oup.com/jamia/article-pdf/28/10/2093/40408836/ocab128.pdf

Reference36 articles.

1. Scalable and accurate deep learning with electronic health records;Rajkomar;NPJ Digit Med,2018

2. Automated de-identification of free-text medical records;Neamatullah;BMC Med Inform Decis Mak,2008

3. HIPAA and protecting health information in the 21st Century;Cohen;JAMA,2018

4. De-identification of patient notes with recurrent neural networks;Dernoncourt;J Am Med Inform Assoc,2017

5. De-identification of clinical notes via recurrent neural network and conditional random field;Liu;J Biomed Inform,2017

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. De-identification of clinical free text using natural language processing: A systematic review of current approaches;Artificial Intelligence in Medicine;2024-05

2. Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse;Methods of Information in Medicine;2024-03-05