Abstract
Abstract
Background
Using XGBoost (XGB), this study demonstrates how flexible machine learning modelling can complement traditional statistical modelling (multinomial logistic regression) as a sensitivity analysis and predictive modelling tool in occupational health research.
Design
The study predicts welfare dependency for a cohort at 1, 3, and 5 years of follow-up using XGB and multinomial logistic regression (MLR). The models’ predictive ability is evaluated using tenfold cross-validation (internal validation) and geographical validation (semi-external validation). In addition, we calculate and graphically assess Shapley additive explanation (SHAP) values from the XGB model to examine deviation from linearity assumptions, including interactions. The study population consists of all 20–54 years old on long-term sickness absence leave due to self-reported common mental disorders (CMD) between April 26, 2010, and September 2012 in 21 (of 98) Danish municipalities that participated in the Danish Return to Work program. The total sample of 19.664 observations is split geospatially into a development set (n = 9.756) and a test set (n = 9.908).
Results
There were no practical differences in the XGB and MLR models’ predictive ability. Industry, job skills, citizenship, unemployment insurance, gender, and period had limited importance in predicting welfare dependency in both models. On the other hand, welfare dependency history and reason for sickness absence were strong predictors. Graphical SHAP-analysis of the XGB model did not indicate substantial deviations from linearity assumptions implied by the multinomial regression model.
Conclusion
Flexible machine learning models like XGB can supplement traditional statistical methods like multinomial logistic regression in occupational health research by providing a benchmark for predictive performance and traditional statistical models' ability to capture important associations for a given set of predictors as well as potential violations of linearity.
Trial registration
ISRCTN43004323.
Funder
Danish Ministry of Employment
Publisher
Springer Science and Business Media LLC
Subject
Public Health, Environmental and Occupational Health
Reference35 articles.
1. Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer; 2017.
2. Mooney SJ, Pejaver V. Big Data in Public Health: Terminology, Machine Learning, and Privacy. Annu Rev Public Health. 2018;39:95–112.
3. Steyerberg E. Clinical Prediction models - a practical approach to development, validation, and updating. Cham: Springer; 2019.
4. Kuhn M, Johnson K. Applied Predictive Modeling. New York: Springer; 2013.
5. Shmueli G. To Explain or to Predict? Stat Sci. 2010;25(3):289–310.
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献