Random Forest and CatBoost with Handling Imbalanced Class for Detection of Risk Factors Anemia in Children (5-12 Years)-Reference-Cited by-同舟云学术

Random Forest and CatBoost with Handling Imbalanced Class for Detection of Risk Factors Anemia in Children (5-12 Years)

Published:2024-06-05 Issue:3 Volume:11 Page:302-312
ISSN:2394-4099
Container-title:International Journal of Scientific Research in Science, Engineering and Technology
language:
Short-container-title:Int J Sci Res Sci Eng Technol

Author:

Ditia Yosmita Praptiwi ,Anang Kurnia ,Anwar Fitrianto ,Fitrah Ernawati

Abstract

The prevalence of anemia in children (5-12 years) remains a public health issue in Indonesia. Early detection and control of risk factors are crucial for prevention. Machine learning models can be employed to address this problem. One practical approach is using ensemble learning models. However, it is expected to encounter imbalanced class problems when analyzing health data. Therefore, this study aims to perform classification modeling using two ensemble learning models: Random Forest (RF) and CatBoost. The proposed methods for handling imbalanced class issues include Random Over Sampling, SMOTE, G-SMOTE, Random Under Sampling, Instance Hardness Threshold (IHT), and SMOTE-ENN. Additionally, SHAP is used to explain the best-performing model based on Shapley values. The research findings indicate that the ensemble learning model using the CatBoost algorithm with G-SMOTE data handling produces the best performance compared to other methods. Based on the average performance metrics from 100 replicate validation, the CatBoost G-SMOTE model produces a sensitivity of 0.7104, specificity of 0.7043, G-Mean of 0.7067, and AUC of 0.7844. Handling the imbalance class problem using the G-SMOTE method effectively increases the sensitivity value in the two proposed ensemble learning models. Meanwhile, the SMOTE-ENN method produces effective G-Mean values for the Random Forest (RF) algorithms. Based on Shapley's value, the features with the highest contribution to predicting anemia in children (5-12 years) are ferritin, vitamin A, consumption of vegetables, diagnosed pneumonia, zinc, calcium total, and consumption of soft or carbonated drinks.

Publisher

Technoscience Academy

Reference25 articles.

1. F. Ofori, E. Maina, and R. Gitonga, “Using Machine Learning Algorithms to Predict Students’ Performance and Improve Learning Outcome: A Literature Based Review,” J. Inf. Technol., vol. 4, no. 1, pp. 2616–3573, 2020, [Online]. Available: https://stratfordjournals.org/journals/index.php/Journal-of-Information-and-Techn/article/view/480

2. P. Vuttipittayamongkol, E. Elyan, and A. Petrovski, “On the class overlap problem in imbalanced data classification,” Knowledge-Based Syst., vol. 212, p. 106631, 2021, doi: 10.1016/j.knosys.2020.106631.

3. R. Hassanzadeh, M. Farhadian, and H. Rafieemehr, “Hospital mortality prediction in traumatic injuries patients: comparing different SMOTE-based machine learning algorithms,” BMC Med. Res. Methodol., vol. 23, no. 1, pp. 1–15, 2023, doi: 10.1186/s12874-023-01920-w.

4. G. Douzas and F. Bacao, “Geometric SMOTE a geometrically enhanced drop-in replacement for SMOTE,” Inf. Sci. (Ny)., vol. 501, pp. 118–135, 2019, doi: 10.1016/j.ins.2019.06.007.

5. M. R. Smith, T. Martinez, and C. Giraud-Carrier, “An instance level analysis of data complexity,” Mach. Learn., vol. 95, no. 2, pp. 225–256, 2014, doi: 10.1007/s10994-013-5422-z.