BACKGROUND
A prolonged length of hospitalization drains both human and material hospital resources as well having a deleterious psychological effect on the patient. Some patients are at greater risk of a prolonged hospital stay than others and it is important to identify them in the first days after admission so as to implement appropriate care as soon as possible and program staff and bed occupancy needs.
OBJECTIVE
The objective of this study is to optimize the prediction of prolonged length of hospital stay (LOS) by refining the selection of variables using an interpretable machine-learning algorithm.
METHODS
Deidentified patient administrative and clinical data from various sources are stored in our University Hospital’s Clinical Data Warehouse, which contains data from 134,840 adult patients with 273,693 hospitalizations between 2016 and 2018. We conducted a two-stage predictive modeling experiment. Initially, we utilized conventional clinical variables and composite variables (by aggregating appropriate conventional variables to form new variables) in several machine-learning algorithms to select the best-performing model. Next, we employed the SHAP method to identify the most important predictive variables and used these to further improve the predictive model.
RESULTS
XGBoost with an undersampling method outperformed other methods with an AUC-ROC of 0.802 (95% CI: 0.801-0.803) and an F2 score of 0.533 (95% CI: 0.533-0.534). The predictive performance was equivalent if we selected half the number of variables based on the SHAP-value with an AUC-ROC of 0.804 (95%: CI: 0.803-0.805) and F2 score of 0.536 (95%: CI: 0.535-0.536). This consistency held for significant variable reduction with SHAP values of more than 70% from 523 to 150.
CONCLUSIONS
SHAP-value-based variable selection allowed a reduction in the number of variables for equivalent predictive performance, making optimum prediction of prolonged LOS easier to implement in routine clinical practice by prioritizing the predictive factors.