Abstract
AbstractPURPOSECancer registries are important sources of real-world data (RWD) that reveal insights into practice patterns and cancer patient outcomes, but the prevalence of missing data can be high. Machine learning (ML) imputation methods can be applied to large RWD sets, but the performance of these approaches within cancer registries is unclear.METHODSWe identified non-small cell lung cancer (NSCLC) patients within the National Cancer Database diagnosed in 2014 with complete data in 19 variables of known clinical and prognostic significance. We generated synthetic missing data for each variable, then performed imputation using substitution (control) and five different ML approaches. Imputation efficacy was measured by normalized root-mean-square error (RMSE) for continuous variables and proportion of falsely classified entries (PFC) for categorical variables. We also measured algorithm runtimes and the impact of incorporating imputed values on survival modeling.RESULTS50,790 NSCLC patients were included for this study, with 81 features for each patient after data preprocessing. Among the tested ML methods, SoftImpute had the lowest RMSE (best performance) for continuous variables ranging from 0.071 to 0.080 for 10% to 50% missing data, and MissForest had the lowest PFC (best performance) for categorical variables ranging from 0.251 to 0.311 for 10 to 50% missing data. SoftImpute had a runtime of 3.28×10−4 seconds per patient record, and MissForest averaged 2.96×10−3 seconds per patient record. Deep learning imputation using a denoising autoencoder did not achieve improved performance despite higher algorithm runtimes. Cox models incorporating ML imputed data achieved similar C-index ranging from 0.787 to 0.801 for all ML methods tested.CONCLUSIONML imputation achieved promising performance for NSCLC patients within a large national cancer registry.
Publisher
Cold Spring Harbor Laboratory