New Dimensions of Public Health and Person-Centric Care in One Health System through Self-learning of a Population’s Pathology Records: Automating an Annotation-Free Natural Language Processing Pipeline and an Example (Preprint)

Author:

Guan JingjingORCID,Leung EmanORCID,Ching Chun Cheung,He Yinan,Yau Sarah TY,Huang JunjieORCID,Tang Raymond SYORCID,Lam Thomas YTORCID,Wong Martin Chi-sangORCID,Lee AlbertORCID,Yeoh Eng-kiongORCID

Abstract

BACKGROUND

Enabling a health system to learn from its historical and emerging data is a primary focus of medical AI research. Though digital pathology (DP) hasn’t gained similar popularity as clinical radiology and hospitalization research, its semistructured data drove natural language processing (NLP) to reveal codable insights from textual data. However, obtaining high-quality annotated samples as yet depended on predefined templates or human annotators, which became a bottleneck of automation. We noticed the prolonged undermining of morphology electronic health records (EHR) and its potential to supply high-quality labels and be the stepping stone towards automatic AI and a self-learning system.

OBJECTIVE

To develop an annotation-free NLP pipeline with proper human control for auto-deriving precise codes that had been annotated by a health system’s pathologists, text preprocessing, constructing machine learning classifiers to annotate text with clinically precise codes, and enabling system-wide application of the designed NLP pipeline to investigate historical data and enhance health information, promotion, and communication.

METHODS

Using colorectal dysplasia as an example, we developed the NLP pipeline with EHR of a population who attended baseline colorectal procedures in Hong Kong’s public health system between 2000 and 2018 when aged 50-75 years. The high-quality morphology codes were precisely-graded dysplasia, where high-grade dysplasia served as the positive label. After identifying precisely-coded, ambiguously-coded, and unlabeled cases from the EHR, we standardized the textual data before feeding them into a bidirectional long short-term memory neural network classification model. Our experimental design examined factors including two kinds of the unit of text analysis (report-/episode-based), the active learning with text curation, and the minimum sample size required for training an accurate classifier. Model performance was measured in testing and validation sets by the area under the receiver operating curve (AUC).

RESULTS

More than 35% of eligible text reports mentioned dysplasia. Precisely-graded dysplasia yielded a low prevalence in morphology codes. Still, the NLP pipeline identified more than 10,000 cases of high-grade dysplasia, which supplied a sufficient amount of positive cases for proving the efficacy of the proposed NLP pipeline. All testing AUCs of report-based active learning with text curation exceeded 0.88. A 200 sample size or more could secure 0.95 testing AUCs with active learning of text curation. Holding other factors the same, validation AUCs were worse than testing AUCs, indicating ambiguously-labeled cases were likely of the complex original text.

CONCLUSIONS

We demonstrated the feasibility, novel performances, and applications in automating annotation-free NLP pipelines at a system level. Our interdisciplinary pipeline can be a formal standard approach for a health system to realize self-learning from semistructured pathology EHR, with an orientation of precision public health and better person-centric care.

Publisher

JMIR Publications Inc.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3