New Dimensions of Public Health and Person-Centric Care in One Health System through Self-learning of a Population’s Pathology Records: Automating an Annotation-Free Natural Language Processing Pipeline and an Example (Preprint)-Reference-Cited by-同舟云学术

New Dimensions of Public Health and Person-Centric Care in One Health System through Self-learning of a Population’s Pathology Records: Automating an Annotation-Free Natural Language Processing Pipeline and an Example (Preprint)

Published:2023-05-17 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Guan Jingjing^ORCID,Leung Eman^ORCID,Ching Chun Cheung,He Yinan,Yau Sarah TY,Huang Junjie^ORCID,Tang Raymond SY^ORCID,Lam Thomas YT^ORCID,Wong Martin Chi-sang^ORCID,Lee Albert^ORCID,Yeoh Eng-kiong^ORCID

Abstract

BACKGROUND

Enabling a health system to learn from its historical and emerging data is a primary focus of medical AI research. Though digital pathology (DP) hasn’t gained similar popularity as clinical radiology and hospitalization research, its semistructured data drove natural language processing (NLP) to reveal codable insights from textual data. However, obtaining high-quality annotated samples as yet depended on predefined templates or human annotators, which became a bottleneck of automation. We noticed the prolonged undermining of morphology electronic health records (EHR) and its potential to supply high-quality labels and be the stepping stone towards automatic AI and a self-learning system.

OBJECTIVE

To develop an annotation-free NLP pipeline with proper human control for auto-deriving precise codes that had been annotated by a health system’s pathologists, text preprocessing, constructing machine learning classifiers to annotate text with clinically precise codes, and enabling system-wide application of the designed NLP pipeline to investigate historical data and enhance health information, promotion, and communication.

METHODS

Using colorectal dysplasia as an example, we developed the NLP pipeline with EHR of a population who attended baseline colorectal procedures in Hong Kong’s public health system between 2000 and 2018 when aged 50-75 years. The high-quality morphology codes were precisely-graded dysplasia, where high-grade dysplasia served as the positive label. After identifying precisely-coded, ambiguously-coded, and unlabeled cases from the EHR, we standardized the textual data before feeding them into a bidirectional long short-term memory neural network classification model. Our experimental design examined factors including two kinds of the unit of text analysis (report-/episode-based), the active learning with text curation, and the minimum sample size required for training an accurate classifier. Model performance was measured in testing and validation sets by the area under the receiver operating curve (AUC).

RESULTS

More than 35% of eligible text reports mentioned dysplasia. Precisely-graded dysplasia yielded a low prevalence in morphology codes. Still, the NLP pipeline identified more than 10,000 cases of high-grade dysplasia, which supplied a sufficient amount of positive cases for proving the efficacy of the proposed NLP pipeline. All testing AUCs of report-based active learning with text curation exceeded 0.88. A 200 sample size or more could secure 0.95 testing AUCs with active learning of text curation. Holding other factors the same, validation AUCs were worse than testing AUCs, indicating ambiguously-labeled cases were likely of the complex original text.

CONCLUSIONS

We demonstrated the feasibility, novel performances, and applications in automating annotation-free NLP pipelines at a system level. Our interdisciplinary pipeline can be a formal standard approach for a health system to realize self-learning from semistructured pathology EHR, with an orientation of precision public health and better person-centric care.

Publisher

JMIR Publications Inc.

Reference46 articles.

1. Real-world data for precision public health of noncommunicable diseases: a scoping review

2. Research progress in digital pathology: A bibliometric and visual analysis based on Web of Science

3. Prediction of Polyp Pathology Using Convolutional Neural Networks Achieves “Resect and Discard” Thresholds

4. Artificial intelligence-driven structurization of diagnostic information in free-text pathology reports

5. Natural language processing in pathology: a scoping review