Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance-Reference-Cited by-同舟云学术

Developing Clinical Prediction Models Using Primary Care Electronic Health Record Data: The Impact of Data Preparation Choices on Model Performance

Published:2022-06-02 Issue: Volume:2 Page:
ISSN:2674-1199
Container-title:Frontiers in Epidemiology
language:
Short-container-title:Front. Epidemiol.

Author:

van Os Hendrikus J. A.,Kanning Jos P.,Wermer Marieke J. H.,Chavannes Niels H.,Numans Mattijs E.,Ruigrok Ynte M.,van Zwet Erik W.,Putter Hein,Steyerberg Ewout W.,Groenwold Rolf H. H.

Abstract

ObjectiveTo quantify prediction model performance in relation to data preparation choices when using electronic health records (EHR).Study Design and SettingCox proportional hazards models were developed for predicting the first-ever main adverse cardiovascular events using Dutch primary care EHR data. The reference model was based on a 1-year run-in period, cardiovascular events were defined based on both EHR diagnosis and medication codes, and missing values were multiply imputed. We compared data preparation choices based on (i) length of the run-in period (2- or 3-year run-in); (ii) outcome definition (EHR diagnosis codes or medication codes only); and (iii) methods addressing missing values (mean imputation or complete case analysis) by making variations on the derivation set and testing their impact in a validation set.ResultsWe included 89,491 patients in whom 6,736 first-ever main adverse cardiovascular events occurred during a median follow-up of 8 years. Outcome definition based only on diagnosis codes led to a systematic underestimation of risk (calibration curve intercept: 0.84; 95% CI: 0.83–0.84), while complete case analysis led to overestimation (calibration curve intercept: −0.52; 95% CI: −0.53 to −0.51). Differences in the length of the run-in period showed no relevant impact on calibration and discrimination.ConclusionData preparation choices regarding outcome definition or methods to address missing values can have a substantial impact on the calibration of predictions, hampering reliable clinical decision support. This study further illustrates the urgency of transparent reporting of modeling choices in an EHR data setting.

Funder

Hartstichting

ZonMw

Hersenstichting

European Commission

Publisher

Frontiers Media SA

Reference35 articles.

1. Systematic review: impact of health information technology on quality, efficiency, and costs of medical care;Chaudhry;Ann Intern Med.,2006

2. Sharing data from electronic health records within, across, and beyond healthcare institutions: current trends and perspectives;Ohno-Machado;J Am Med Inform Assoc.,2018

3. The inevitable application of big data to health care;Murdoch;JAMA.,2013

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data Resource Profile: Extramural Leiden University Medical Center Academic Network (ELAN);International Journal of Epidemiology;2024-06-12

2. Prediction of aneurysmal subarachnoid hemorrhage in comparison with other stroke types using routine care data;PLOS ONE;2024-05-31