Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients-Reference-Cited by-同舟云学术

Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

Published:2022-10-04 Issue:19 Volume:12 Page:9976
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Paul Tanmoy^ORCID,Islam Humayera^ORCID,Singh Nitesh,Jampani Yaswitha,Kotapati Teja Venkat Pavan,Tautam Preethi Aishwarya,Rana Md Kamruz Zaman,Mandhadi Vasanthi,Sharma Vishakha,Barnes Michael,Hammer Richard D.^ORCID,Mosa Abu Saleh Mohammad

Abstract

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/12/19/9976/pdf

Reference25 articles.

1. Automated de-identification of free-text medical records

2. Department of Health and Human Services Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule; 2003; ISBN 2800228032

3. The Method of Medical Named Entity Recognition Based on Semantic Model and Improved SVM-KNN Algorithm;Xia;Proceedings of the 7th International Conference on Semantics, Knowledge, and Grids, SKG 2011,2011

4. Biomedical text mining and its applications in cancer research

5. Challenges in Clinical Named Entity Recognition for Decision Support;Dehghan;Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013,2013

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. De-identification of clinical free text using natural language processing: A systematic review of current approaches;Artificial Intelligence in Medicine;2024-05