Author:
Boo Yookyung,Choi Youngjin
Abstract
Abstract
Background
Injuries caused by RTA are classified under the International Classification of Diseases-10 as ‘S00-T99’ and represent imbalanced samples with a mortality rate of only 1.2% among all RTA victims. To predict the characteristics of external causes of road traffic accident (RTA) injuries and mortality, we compared performances based on differences in the correction and classification techniques for imbalanced samples.
Methods
The present study extracted and utilized data spanning over a 5-year period (2013–2017) from the Korean National Hospital Discharge In-depth Injury Survey (KNHDS), a national level survey conducted by the Korea Disease Control and Prevention Agency, A total of eight variables were used in the prediction, including patient, accident, and injury/disease characteristics. As the data was imbalanced, a sample consisting of only severe injuries was constructed and compared against the total sample. Considering the characteristics of the samples, preprocessing was performed in the study. The samples were standardized first, considering that they contained many variables with different units. Among the ensemble techniques for classification, the present study utilized Random Forest, Extra-Trees, and XGBoost. Four different over- and under-sampling techniques were used to compare the performance of algorithms using “accuracy”, “precision”, “recall”, “F1”, and “MCC”.
Results
The results showed that among the prediction techniques, XGBoost had the best performance. While the synthetic minority oversampling technique (SMOTE), a type of over-sampling, also demonstrated a certain level of performance, under-sampling was the most superior. Overall, prediction by the XGBoost model with samples using SMOTE produced the best results.
Conclusion
This study presented the results of an empirical comparison of the validity of sampling techniques and classification algorithms that affect the accuracy of imbalanced samples by combining two techniques. The findings could be used as reference data in classification analyses of imbalanced data in the medical field.
Publisher
Springer Science and Business Media LLC
Subject
Public Health, Environmental and Occupational Health
Cited by
9 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献