Automatic Deidentification of French Electronic Health Records: A Cost-Effective Approach Exploiting Distant Supervision and Deep Learning Models

Author:

azzouzi Mohamed El1,Coatrieux Gouenou2,Bellafqira Reda2,Delamarre Denis1,Riou Christine1,Oubenali Naima3,Cabon Sandie1,Cuggia Marc1,Bouzillé Guillaume1

Affiliation:

1. Univ Rennes, CHU Rennes, INSERM, LTSI-UMR 1099

2. IMT Atlantique, INSERM, LATIM - UMR 1101

3. Univ. Lille

Abstract

Abstract Background: Electronic health records (EHRs) contain valuable information for clinical research; however, the sensitive nature of healthcare data presents security and confidentiality challenges. Deidentification is therefore essential to protect personal data in EHRs and comply with government regulations. Named entity recognition (NER) methods have been proposed to remove personal identifiers, with deep learning-based models achieving better performance. However, manual annotation of training data is time-consuming and expensive. The aim of this study was to develop an automatic deidentification pipeline for all kinds of clinical documents based on a distant supervised method to significantly reduce the cost of manual annotations and to facilitate the transfer of the deidentification pipeline to other clinical centers. Methods: We proposed an automated annotation process for French clinical deidentification, exploiting data from the eHOP clinical data warehouse(CDW) of the CHU de Rennes and national knowledge bases, as well as other features. In addition, this paper proposes an assisted data annotation solution using the Prodigy annotation tool. This approach aims to reduce the cost required to create a reference corpus for the evaluation of state-of-the-art NER models. Finally, we evaluated and compared the effectiveness of different NER methods. Results: A French deidentification dataset was developed in this work, based on EHRs provided by the eHOP CDW at Rennes University Hospital, France. The dataset was rich in terms of personal information, and the distribution of entities was quite similar in the training and test datasets. We evaluated a Bi-LSTM + CRF sequence labeling architecture, combined with Flair + FastText word embeddings, on a test set of manually annotated clinical reports. The model outperformed the other tested models with a significant F1 score of 96,96%, demonstrating the effectiveness of our automatic approach for deidentifying sensitive information. Conclusions: This study provides an automatic deidentification pipeline for clinical notes, which can facilitate the reuse of EHRs for secondary purposes such as clinical research. Our study highlights the importance of using advanced NLP techniques for effective de-identification, as well as the need for innovative solutions such as distant supervision to overcome the challenge of limited annotated data in the medical domain.

Publisher

Research Square Platform LLC

Reference59 articles.

1. Electronic health records: new opportunities for clinical research;Coorevits P;J Intern Med,2013

2. Secondary Use of Electronic Health Record: Opportunities and Challenges;Shah SM;IEEE Access,2020

3. Confidentiality issues for medical data miners;Berman JJ;Artif Intell Med,2002

4. Bourdois L, Avalos M, Chenais G, Thiessard F, Revel P, Gil-Jardiné C, et al. De-identification of Emergency Medical Records in French: Survey and Comparison of State-of-the-Art Automated Systems. Volume 34. Florida Artificial Intelligence Research Society; 2021.

5. Automatic de-identification of textual documents in the electronic health record: a review of recent research;Meystre SM;BMC Med Res Methodol,2010

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3