Benchmarking four algorithms for improved classification of agricultural injury cases from free-text analysis of pre-hospital care reports (Preprint)-Reference-Cited by-同舟云学术

Benchmarking four algorithms for improved classification of agricultural injury cases from free-text analysis of pre-hospital care reports (Preprint)

Published:2024-08-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Jones Laura E.^ORCID,Scott Erika^ORCID,Krupa Nicole,Kern Megan,Hansen-Ruiz Cristina Silvia^ORCID,Jenkins Paul^ORCID

Abstract

BACKGROUND

Fatality rates in Agriculture, Forestry, and Fishing (AgFF) industries are historically the highest of any US sector, with a combined rate of 18.6 deaths per 100,000 workers. Despite clear trends for fatal AgFF workplace injuries in federal data, challenges remain in capturing nonfatal agricultural injuries. The Northeast Center for Occupational Health and Safety (NEC) developed a naïve Bayes-based classification strategy to extract non-fatal injury cases from pre-hospital (EMS) free-text records.

OBJECTIVE

The aim of this paper is to improve retrieval rates, in terms of false positive rate required to obtain a true positive rate of 0.90, by benchmarking naïve Bayes against three other algorithms: elastic net regression, Support Vector Machines, and boosted decision trees (XGBoost).

METHODS

Using a labeled, fully one-hot coded gold-standard dataset (N=60,143) with substantial (24%) missing data, we benchmark these algorithms on complete case data (N=44,566) and imputed data from two imputation schemes: grouped hot-deck and recoding of missing units to the category “unknown,” using a 75:25 train/test split and stratified sampling.

RESULTS

All models produced similarly accuracies (~0.98) on complete case data, though necessary False Positive Rates (FPR*) varied from 0.055 (XGBoost) to 0.20 (naïve Bayes) on training data, and on predictions, the range was 0.10 (elastic net) to 0.22 (naïve Bayes). On imputed data, accuracies ranged from 0.96 (Bayes) to 0.98 (XGBoost) for training data, yielding false positive rates from 0.095 (XGBoost) to 0.34 (Bayes). Predictions from imputed data showed FPR* ranging from 0.12 (XGBoost) to 0.41 (Bayes) depending on imputation scheme.

CONCLUSIONS

While all four models perform well on complete data, missing units are substantial and can result in misclassification and in omissions, both requiring human coding. Reliance on a machine learning method that is robust to missing data and imputation method, such as XGBoost, is a reasonable approach to improving classification rates without omitting data.

Publisher

JMIR Publications Inc.

Reference42 articles.

1. The economic burden of occupational fatal injuries to civilian workers in the United States based on the census of fatal occupational injuries, 1992-2002.

2. The agrarian myth and policy responses to farm safety.

3. An Estimate of the U.S. Government’s Undercount of Nonfatal Occupational Injuries

4. An estimate of the U.S. government's undercount of nonfatal occupational injuries and illnesses in agriculture

5. Consumer Product Safety Commission