Natural Language Processing in Urology: Automated Extraction of Clinical Information from Histopathology Reports of Uro-Oncology Procedures (Preprint)-Reference-Cited by-同舟云学术

Natural Language Processing in Urology: Automated Extraction of Clinical Information from Histopathology Reports of Uro-Oncology Procedures (Preprint)

Published:2022-02-16 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Huang Honghong,Lim Fiona Xin Yi,Gu Gary Tianyu^ORCID,Han Jiangchou Matthew,Fang Andrew Hao Sen,Chia Elian Hui San,Bei Yen Tze Eileen^ORCID,Tham Sarah Zhuling,Sun Aixin,Lim Kheng Sit^ORCID

Abstract

BACKGROUND

Clinical information is primarily stored digitally as free text in electronic health records (EHRs) at the Singapore General Hospital (SGH). Traditional extraction of registry data fields is manual, laborious and prone to errors.

OBJECTIVE

We aimed to automate routine extraction of clinically relevant unstructured information from uro-oncological histopathology reports by applying rule-based and machine learning (ML) /deep learning (DL) methods to develop an oncology focused natural language processing (NLP) algorithm.

METHODS

Our algorithm employs a combination of a rule-based approach and support vector machines /neural networks (BioBert/Clinical BERT), and is optimised for accuracy. We randomly extracted 5772 uro-oncological histology reports from 2008 to 2018 from EHRs and split the data into training and validation datasets in an 80:20 ratio. The training dataset was annotated by medical professionals and reviewed by cancer registrars. The validation dataset was annotated by cancer registrars and defined as the gold standard with which the algorithm outcomes were compared. The accuracy of NLP-parsed data was matched against these human annotation results. We defined an accuracy rate of >95% as “acceptable” by professional human extraction, as per our cancer registry definition.

RESULTS

There were 11 extraction variables in 268 free-text reports. We achieved an accuracy rate of between 61.2% to 99.0% using our algorithm. Of the 11 data fields, a total of 8 data fields met the acceptable accuracy standard, while another 3 data fields had an accuracy rate between 61.2% to 89.7%. Noticeably, the rule-based approach was shown to be more effective and robust in extracting variables of interest. On the other hand, ML/DL models had poorer predictive performances due to highly imbalanced data distribution and variable writing styles between different reports and data used for domain-specific pre-trained models.

CONCLUSIONS

We designed an NLP algorithm that can automate clinical information extraction accurately from histopathology reports with an overall average micro accuracy of 93.3%. The algorithm can be modified and continuously improved to extract new and existing variables.

Publisher

JMIR Publications Inc.

Reference6 articles.

1. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review

2. Clinical text classification with rule-based features and knowledge-guided convolutional neural networks

3. Deep Learning for Natural Language Processing in Urology: State-of-the-Art Automated Extraction of Detailed Pathologic Prostate Cancer Data From Narratively Written Electronic Health Records

4. Validity of Natural Language Processing for Ascertainment of EGFR and ALK Test Results in SEER Cases of Stage IV Non–Small-Cell Lung Cancer

5. Privacy-Preserving Deep Learning NLP Models for Cancer Registries