Abstract
AbstractBackgroundThe Fontan operation palliates single ventricle heart defects and is associated with significant morbidity and premature mortality. Native anatomy varies; thus, Fontan cases cannot always be identified byInternational Classification of Diseases, Ninth and Tenth Revision, Clinical Modification (ICD-9-CM and ICD-10-CM)codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing (NLP) based machine learning (ML) models, which utilize free text notes of patients, to automatically detect Fontan cases, and compare their performances with ICD code based classification.Methods and ResultsWe included free text notes of 10,935 manually validated patients, of whom 778 (7.1%) were Fontan and 10,157 (92.9%) non-Fontan patients, from two large, diverse healthcare systems. Using 5-fold cross validation, we trained and evaluated multiple ML models, namely support vector machines (SVM) and a transformer based model for language understanding named RoBERTa (2 versions), for automatically identifying Fontan cases based on free text notes. To optimize classifier performances, we experimented with different text representation techniques, including a sliding window strategy to overcome the length limit imposed by RoBERTa. We compared the performances of the ML models to ICD code based classification using the F1score metric. The ICD classification model, SVM, and RoBERTa achieved F1scores of 0.81 (95% CI: 0.79-0.83), 0.95 (95% CI: 0.92-0.97), and 0.89 (95% CI: 0.88-0.85) for the positive (Fontan) class, respectively. SVM obtained the best performance (p<0.05), and both NLP models outperformed ICD code based classification (p<0.05). The novel sliding window strategy improved performance over the base RoBERTa model (p<0.05) but did not outperform SVM. ICD code based classification tended to have more false positives compared to both NLP models.ConclusionsOur proposed NLP models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes. Since the sensitivity of ICD codes is high but the positive predictive value is low, it may be beneficial to apply ICD codes as a filter prior to applying NLP/ML to achieve optimal performance.
Publisher
Cold Spring Harbor Laboratory