Harnessing Moderate-Sized Language Models for Reliable Patient Data De-identification in Emergency Department Records: An Evaluation of Strategies and Performance (Preprint)

Author:

Dorémus OcéaneORCID,Russon DylanORCID,Contrand BenjaminORCID,Guerra-Adames ArielORCID,Avalos-Fernandez MartaORCID,Gil-Jardiné CédricORCID,Lagarde EmmanuelORCID

Abstract

BACKGROUND

The digitization of healthcare, facilitated by the adoption of electronic health record (EHR) systems, has revolutionized data-driven medical research and patient care. While this digital transformation offers substantial benefits in healthcare efficiency and accessibility, it concurrently raises significant concerns over privacy and data security. Initially, the journey towards protecting patient data de-identification saw the transition from rule-based systems to more mixed approaches including machine learning for de-identifying patient data. Subsequently, the emergence of Large Language Models (LLMs) has represented a further opportunity in this domain, offering unparalleled potential for enhancing the accuracy of context-sensitive de-identification. However, despite LLMs offering significant potential, the deployment of the most advanced models in hospital environments is frequently hindered by data security issues and the extensive hardware resources required.

OBJECTIVE

The objective of our study is to design, implement, and evaluate de-identification algorithms by employing fine-tuning of moderate-sized open-source language models, ensuring their suitability for production inference tasks on personal computers.

METHODS

We aimed at replacing personal identifying information (PII) with generic placeholders or labeling non-PII texts as 'ANONYMOUS', ensuring privacy while preserving textual integrity. Our dataset, derived from over 425,000 clinical notes from the adult emergency department of the Bordeaux University Hospital in France, underwent independent double annotation by two experts to create a reference for model validation with 3,000 clinical notes randomly selected. Three open-source language models of manageable size were selected for their feasibility in hospital settings: Llama 2 7B, Mistral 7B, and Mixtral 8x7B. Fine-tuning utilized the quantized Low-Rank Adaptation (qLoRA) technique. Evaluation focused on PII-level (Recall, Precision and F1-Score) and clinical note-level metrics (Recall and BLEU metric), assessing de-identification effectiveness and content preservation.

RESULTS

The generative model Mistral 7B demonstrated the highest performance with an overall F1-score of 0.9673 (vs. 0.8750 for Llama 2 and 0.8686 for Mistral 8x7B). At the clinical notes level, the same model achieved an overall recall of 0.9326 (vs. 0.6888 for Llama 2 and 0.6417 for Mistral 8x7B).This rate increased to 0.9915 for the anonymization of names with Mistral 7B. Four notes out of the total 3000 failed to be fully anonymized for names: in one case, the non-anonymized name belonged to a patient, while in the other cases, it belonged to medical staff. Beyond the fifth epoch, the BLEU score consistently exceeded 0.9864, indicating no significant text alteration due to the process.

CONCLUSIONS

Our research underscores the significant capabilities of generative NLP models, with Mistral 7B standing out for its superior ability to de-identify clinical texts efficiently. Achieving notable performance metrics, Mistral 7B operates effectively without requiring high-end computational resources. These methods pave the way for a broader availability of anonymized clinical texts, enabling their use for research purposes and the optimization of the healthcare system.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3