BACKGROUND
Fatality rates in Agriculture, Forestry, and Fishing (AgFF) industries are historically the highest of any US sector, with a combined rate of 18.6 deaths per 100,000 workers. Despite clear trends for fatal AgFF workplace injuries in federal data, challenges remain in capturing nonfatal agricultural injuries. The Northeast Center for Occupational Health and Safety (NEC) developed a naïve Bayes-based classification strategy to extract non-fatal injury cases from pre-hospital (EMS) free-text records.
OBJECTIVE
The aim of this paper is to improve retrieval rates, in terms of false positive rate required to obtain a true positive rate of 0.90, by benchmarking naïve Bayes against three other algorithms: elastic net regression, Support Vector Machines, and boosted decision trees (XGBoost).
METHODS
Using a labeled, fully one-hot coded gold-standard dataset (N=60,143) with substantial (24%) missing data, we benchmark these algorithms on complete case data (N=44,566) and imputed data from two imputation schemes: grouped hot-deck and recoding of missing units to the category “unknown,” using a 75:25 train/test split and stratified sampling.
RESULTS
All models produced similarly accuracies (~0.98) on complete case data, though necessary False Positive Rates (FPR*) varied from 0.055 (XGBoost) to 0.20 (naïve Bayes) on training data, and on predictions, the range was 0.10 (elastic net) to 0.22 (naïve Bayes). On imputed data, accuracies ranged from 0.96 (Bayes) to 0.98 (XGBoost) for training data, yielding false positive rates from 0.095 (XGBoost) to 0.34 (Bayes). Predictions from imputed data showed FPR* ranging from 0.12 (XGBoost) to 0.41 (Bayes) depending on imputation scheme.
CONCLUSIONS
While all four models perform well on complete data, missing units are substantial and can result in misclassification and in omissions, both requiring human coding. Reliance on a machine learning method that is robust to missing data and imputation method, such as XGBoost, is a reasonable approach to improving classification rates without omitting data.