Comparison of machine-learning and logistic regression models for prediction of 30-day unplanned readmission in electronic health records: A development and validation study-Reference-Cited by-同舟云学术

Comparison of machine-learning and logistic regression models for prediction of 30-day unplanned readmission in electronic health records: A development and validation study

Published:2024-08-20 Issue:8 Volume:3 Page:e0000578
ISSN:2767-3170
Container-title:PLOS Digital Health
language:en
Short-container-title:PLOS Digit Health

Author:

Iwagami Masao^ORCID,Inokuchi Ryota^ORCID,Kawakami Eiryo,Yamada Tomohide,Goto Atsushi,Kuno Toshiki,Hashimoto Yohei,Michihata Nobuaki,Goto Tadahiro,Shinozaki Tomohiro,Sun Yu,Taniguchi Yuta^ORCID,Komiyama Jun,Uda Kazuaki,Abe Toshikazu,Tamiya Nanako

Abstract

It is expected but unknown whether machine-learning models can outperform regression models, such as a logistic regression (LR) model, especially when the number and types of predictor variables increase in electronic health records (EHRs). We aimed to compare the predictive performance of gradient-boosted decision tree (GBDT), random forest (RF), deep neural network (DNN), and LR with the least absolute shrinkage and selection operator (LR-LASSO) for unplanned readmission. We used EHRs of patients discharged alive from 38 hospitals in 2015–2017 for derivation and in 2018 for validation, including basic characteristics, diagnosis, surgery, procedure, and drug codes, and blood-test results. The outcome was 30-day unplanned readmission. We created six patterns of data tables having different numbers of binary variables (that ≥5% or ≥1% of patients or ≥10 patients had) with and without blood-test results. For each pattern of data tables, we used the derivation data to establish the machine-learning and LR models, and used the validation data to evaluate the performance of each model. The incidence of outcome was 6.8% (23,108/339,513 discharges) and 6.4% (7,507/118,074 discharges) in the derivation and validation datasets, respectively. For the first data table with the smallest number of variables (102 variables that ≥5% of patients had, without blood-test results), the c-statistic was highest for GBDT (0.740), followed by RF (0.734), LR-LASSO (0.720), and DNN (0.664). For the last data table with the largest number of variables (1543 variables that ≥10 patients had, including blood-test results), the c-statistic was highest for GBDT (0.764), followed by LR-LASSO (0.755), RF (0.751), and DNN (0.720), suggesting that the difference between GBDT and LR-LASSO was small and their 95% confidence intervals overlapped. In conclusion, GBDT generally outperformed LR-LASSO to predict unplanned readmission, but the difference of c-statistic became smaller as the number of variables was increased and blood-test results were used.

Funder

Foundation for Promotion of Material Science and Technology of Japan

Publisher

Public Library of Science (PLoS)

Reference33 articles.

1. Rehospitalizations among patients in the Medicare fee-for-service program;SF Jencks;N Engl J Med,2009

2. Preventing 30-day hospital readmissions: a systematic review and meta-analysis of randomized trials;AL Leppin;JAMA Intern Med,2014

3. Introduction to Clinical Prediction Models;M Iwagami;Ann Clin Epidemiol,2022

4. Risk prediction models for hospital readmission: a systematic review;D Kansagara;JAMA,2011

5. Utility of models to predict 28-day or 30-day unplanned hospital readmissions: an updated systematic review;H Zhou;BMJ Open,2016