Natural Language Processing in Urology: Automated Extraction of Clinical Information from Histopathology Reports of Uro-Oncology Procedures (Preprint)

Author:

Huang Honghong,Lim Fiona Xin Yi,Gu Gary TianyuORCID,Han Jiangchou Matthew,Fang Andrew Hao Sen,Chia Elian Hui San,Bei Yen Tze EileenORCID,Tham Sarah Zhuling,Sun Aixin,Lim Kheng SitORCID

Abstract

BACKGROUND

Clinical information is primarily stored digitally as free text in electronic health records (EHRs) at the Singapore General Hospital (SGH). Traditional extraction of registry data fields is manual, laborious and prone to errors.

OBJECTIVE

We aimed to automate routine extraction of clinically relevant unstructured information from uro-oncological histopathology reports by applying rule-based and machine learning (ML) /deep learning (DL) methods to develop an oncology focused natural language processing (NLP) algorithm.

METHODS

Our algorithm employs a combination of a rule-based approach and support vector machines /neural networks (BioBert/Clinical BERT), and is optimised for accuracy. We randomly extracted 5772 uro-oncological histology reports from 2008 to 2018 from EHRs and split the data into training and validation datasets in an 80:20 ratio. The training dataset was annotated by medical professionals and reviewed by cancer registrars. The validation dataset was annotated by cancer registrars and defined as the gold standard with which the algorithm outcomes were compared. The accuracy of NLP-parsed data was matched against these human annotation results. We defined an accuracy rate of >95% as “acceptable” by professional human extraction, as per our cancer registry definition.

RESULTS

There were 11 extraction variables in 268 free-text reports. We achieved an accuracy rate of between 61.2% to 99.0% using our algorithm. Of the 11 data fields, a total of 8 data fields met the acceptable accuracy standard, while another 3 data fields had an accuracy rate between 61.2% to 89.7%. Noticeably, the rule-based approach was shown to be more effective and robust in extracting variables of interest. On the other hand, ML/DL models had poorer predictive performances due to highly imbalanced data distribution and variable writing styles between different reports and data used for domain-specific pre-trained models.

CONCLUSIONS

We designed an NLP algorithm that can automate clinical information extraction accurately from histopathology reports with an overall average micro accuracy of 93.3%. The algorithm can be modified and continuously improved to extract new and existing variables.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3