Utility of Features in a Natural-Language-Processing-Based Clinical De-Identification Model Using Radiology Reports for Advanced NSCLC Patients

Author:

Paul TanmoyORCID,Islam HumayeraORCID,Singh Nitesh,Jampani Yaswitha,Kotapati Teja Venkat Pavan,Tautam Preethi Aishwarya,Rana Md Kamruz Zaman,Mandhadi Vasanthi,Sharma Vishakha,Barnes Michael,Hammer Richard D.ORCID,Mosa Abu Saleh Mohammad

Abstract

The de-identification of clinical reports is essential to protect the confidentiality of patients. The natural-language-processing-based named entity recognition (NER) model is a widely used technique of automatic clinical de-identification. The performance of such a machine learning model relies largely on the proper selection of features. The objective of this study was to investigate the utility of various features in a conditional-random-field (CRF)-based NER model. Natural language processing (NLP) toolkits were used to annotate the protected health information (PHI) from a total of 10,239 radiology reports that were divided into seven types. Multiple features were extracted by the toolkit and the NER models were built using these features and their combinations. A total of 10 features were extracted and the performance of the models was evaluated based on their precision, recall, and F1-score. The best-performing features were n-gram, prefix-suffix, word embedding, and word shape. These features outperformed others across all types of reports. The dataset we used was large in volume and divided into multiple types of reports. Such a diverse dataset made sure that the results were not subject to a small number of structured texts from where a machine learning model can easily learn the features. The manual de-identification of large-scale clinical reports is impractical. This study helps to identify the best-performing features for building an NER model for automatic de-identification from a wide array of features mentioned in the literature.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Reference25 articles.

1. Automated de-identification of free-text medical records

2. Department of Health and Human Services Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule; 2003; ISBN 2800228032

3. The Method of Medical Named Entity Recognition Based on Semantic Model and Improved SVM-KNN Algorithm;Xia;Proceedings of the 7th International Conference on Semantics, Knowledge, and Grids, SKG 2011,2011

4. Biomedical text mining and its applications in cancer research

5. Challenges in Clinical Named Entity Recognition for Decision Support;Dehghan;Proceedings of the 2013 IEEE International Conference on Systems, Man, and Cybernetics, SMC 2013,2013

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3