BACKGROUND
Enabling a health system to learn from its historical and emerging data is a primary focus of medical AI research. Though digital pathology (DP) hasn’t gained similar popularity as clinical radiology and hospitalization research, its semistructured data drove natural language processing (NLP) to reveal codable insights from textual data. However, obtaining high-quality annotated samples as yet depended on predefined templates or human annotators, which became a bottleneck of automation. We noticed the prolonged undermining of morphology electronic health records (EHR) and its potential to supply high-quality labels and be the stepping stone towards automatic AI and a self-learning system.
OBJECTIVE
To develop an annotation-free NLP pipeline with proper human control for auto-deriving precise codes that had been annotated by a health system’s pathologists, text preprocessing, constructing machine learning classifiers to annotate text with clinically precise codes, and enabling system-wide application of the designed NLP pipeline to investigate historical data and enhance health information, promotion, and communication.
METHODS
Using colorectal dysplasia as an example, we developed the NLP pipeline with EHR of a population who attended baseline colorectal procedures in Hong Kong’s public health system between 2000 and 2018 when aged 50-75 years. The high-quality morphology codes were precisely-graded dysplasia, where high-grade dysplasia served as the positive label. After identifying precisely-coded, ambiguously-coded, and unlabeled cases from the EHR, we standardized the textual data before feeding them into a bidirectional long short-term memory neural network classification model. Our experimental design examined factors including two kinds of the unit of text analysis (report-/episode-based), the active learning with text curation, and the minimum sample size required for training an accurate classifier. Model performance was measured in testing and validation sets by the area under the receiver operating curve (AUC).
RESULTS
More than 35% of eligible text reports mentioned dysplasia. Precisely-graded dysplasia yielded a low prevalence in morphology codes. Still, the NLP pipeline identified more than 10,000 cases of high-grade dysplasia, which supplied a sufficient amount of positive cases for proving the efficacy of the proposed NLP pipeline. All testing AUCs of report-based active learning with text curation exceeded 0.88. A 200 sample size or more could secure 0.95 testing AUCs with active learning of text curation. Holding other factors the same, validation AUCs were worse than testing AUCs, indicating ambiguously-labeled cases were likely of the complex original text.
CONCLUSIONS
We demonstrated the feasibility, novel performances, and applications in automating annotation-free NLP pipelines at a system level. Our interdisciplinary pipeline can be a formal standard approach for a health system to realize self-learning from semistructured pathology EHR, with an orientation of precision public health and better person-centric care.