Affiliation:
1. Department of Biostatistics, Columbia University , New York, NY 10032, United States
2. Department of Medicine, Columbia University , New York, NY 10032, United States
Abstract
Abstract
Objective
Electronic health records (EHRs) provide opportunities for the development of computable predictive tools. Conventional machine learning methods and deep learning methods have been widely used for this task, with the approach of usually designing one tool for one clinical outcome. Here we developed PheW2P2V, a Phenome-Wide prediction framework using Weighted Patient Vectors. PheW2P2V conducts tailored predictions for phenome-wide phenotypes using numeric representations of patients’ past medical records weighted based on their similarities with individual phenotypes.
Materials and Methods
PheW2P2V defines clinical disease phenotypes using Phecode mapping based on International Classification of Disease codes, which reduces redundancy and case-control misclassification in real-life EHR datasets. Through upweighting medical records of patients that are more relevant to a phenotype of interest in calculating patient vectors, PheW2P2V achieves tailored incidence risk prediction of a phenotype. The calculation of weighted patient vectors is computationally efficient, and the weighting mechanism ensures tailored predictions across the phenome. We evaluated prediction performance of PheW2P2V and baseline methods with simulation studies and clinical applications using the MIMIC-III database.
Results
Across 942 phenome-wide predictions using the MIMIC-III database, PheW2P2V has median area under the receiver operating characteristic curve (AUC-ROC) 0.74 (baseline methods have values ≤0.72), median max F1-score 0.20 (baseline methods have values ≤0.19), and median area under the precision-recall curve (AUC-PR) 0.10 (baseline methods have values ≤0.10).
Discussion
PheW2P2V can predict phenotypes efficiently by using medical concept embeddings and upweighting relevant past medical histories. By leveraging both labeled and unlabeled data, PheW2P2V reduces overfitting and improves predictions for rare phenotypes, making it a useful screening tool for early diagnosis of high-risk conditions, though further research is needed to assess the transferability of embeddings across different databases.
Conclusions
PheW2P2V is fast, flexible, and has superior prediction performance for many clinical disease phenotypes across the phenome of the MIMIC-III database compared to that of several popular baseline methods.
Funder
National Library of Medicine
Publisher
Oxford University Press (OUP)
Reference31 articles.
1. A review of approaches to identifying patient phenotype cohorts using electronic health records;Shivade;J Am Med Inform Assoc,2014
2. Big data in healthcare: management, analysis and future prospects;Dash;J Big Data,2019
3. Big data in digital healthcare: lessons learnt and recommendations for general practice;Agrawal;Heredity (Edinb),2020
4. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches;Wu;Med Care,2010
5. Electronic health record phenotyping improves detection and screening of type 2 diabetes in the general United States population: a cross-sectional, unselected, retrospective study;Anderson;J Biomed Inform,2016