Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than <i>ICD</i> Codes-Reference-Cited by-同舟云学术

Supervised Text Classification System Detects Fontan Patients in Electronic Records With Higher Accuracy Than ICD Codes

Published:2023-07-04 Issue:13 Volume:12 Page:
ISSN:2047-9980
Container-title:Journal of the American Heart Association
language:en
Short-container-title:JAHA

Author:

Guo Yuting¹^ORCID,Al‐Garadi Mohammed A.²,Book Wendy M.³⁴^ORCID,Ivey Lindsey C.⁴^ORCID,Rodriguez Fred H.³^ORCID,Raskind‐Hood Cheryl L.⁴^ORCID,Robichaux Chad¹^ORCID,Sarker Abeed¹^ORCID

Affiliation:

1. Department of Biomedical Informatics, School of Medicine Emory University Atlanta GA

2. Vanderbilt University Medical Center Vanderbilt University Nashville TN

3. Department of Cardiology, School of Medicine Emory University Atlanta GA

4. Rollins School of Public Health Emory University Atlanta GA

Abstract

Background The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases ( ICD ) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. Methods and Results We included free‐text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non‐Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer‐based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held‐out patient data using the F 1 score metric. The ICD classification model, support vector machine, and RoBERTa achieved F 1 scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance ( P <0.05), and both natural language processing models outperformed ICD code–based classification ( P <0.05). The sliding window strategy improved performance over the base model ( P <0.05) but did not outperform support vector machines. ICD code–based classification produced more false positives. Conclusions Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.

Publisher

Ovid Technologies (Wolters Kluwer Health)

Subject

Cardiology and Cardiovascular Medicine

Reference24 articles.

1. Managing adult Fontan patients: where do we stand?

2. Clinical Phenotypes of Fontan Failure: Implications for Management

3. The Australian and New Zealand Fontan Registry Quality of Life Study: Protocol for a population-based assessment of quality of life among people with a Fontan circulation, their parents, and siblings

4. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review

5. Using Clinical Notes and Natural Language Processing for Automated HIV Risk Assessment

Cited by 3 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Early diagnosis of HIV cases by means of text mining and machine learning models on clinical notes;Computers in Biology and Medicine;2024-09

2. Evaluating large language models for health-related text classification tasks with public social media data;Journal of the American Medical Informatics Association;2024-08-09

3. Adoption of network and plan-do-check-action in the international classification of disease 10 coding;World Journal of Clinical Cases;2024-07-06