Investigating the Effect of Class Balancing Methods on the Performance of Machine Learning Techniques: Credit Risk Application-Reference-Cited by-同舟云学术

Investigating the Effect of Class Balancing Methods on the Performance of Machine Learning Techniques: Credit Risk Application

Published:2024-07-04 Issue:1 Volume:5 Page:55-70
ISSN:2757-637X
Container-title:İzmir Yönetim Dergisi
language:
Short-container-title:

Author:

Milli Migraç Enes Furkan¹^ORCID,Aras Serkan²^ORCID,Deveci Kocakoç İpek²^ORCID

Affiliation:

1. İSTANBUL ÜNİVERSİTESİ

2. DOKUZ EYLÜL ÜNİVERSİTESİ

Abstract

Credit risk arises as a result of the failure of the loans given by banks to the customers to fulfill their obligations at the end of the specified term. Technological advances allow the use of machine learning methods in various sectors. These methods aim to facilitate the identification of customers at risk with the system adapted to the creditworthiness processes of banks. For this purpose, in order to make the most appropriate evaluation in the lending process of banks, re-sampling techniques to eliminate the problem of class imbalance encountered in unbalanced data sets were made balanced and their effects on machine learning were investigated. During the implementation phase, German, Australian and HMEQ credit data sets were used. Different machine learning classification methods such as Logistic Regression (LR), K-Narest Neighbor (KNN), Naive Bayes (NB), Support Vector Machines (SVM), Multilayer Perceptron (MLP), Decision Trees (DT), Random Forests (RF), Gradient Boosting Decision Trees (GBDT), Extremely Randomized Trees, Hard and Soft Voting were used to detect risky customers. The problem of class imbalance was balanced with resampling and hybrid techniques such as Random Oversampling (ROS), Random Undersampling (RUS), Balanced Bagging Classifier (BBC), SMOTE-Tomek Links and SMOTE-ENN. In this context, the performances of three different data sets were examined in four different scenarios. As a result of the study, the hybrid method, in which oversampling and undersampling methods are used together for the class balancing problem, showed the best classification performance among machine learning techniques.

Publisher

Dokuz Eylul University

Reference50 articles.

1. Akman, M., Genç, Y. ve Ankarali, H. (2011). Random Forests Yöntemi ve Saglik Alaninda Bir Uygulama/Random Forests Methods and an Application in Health Science. Türkiye Klinikleri Biyoistatistik. 3(1): 36.

2. Alam, T. M., Shaukat, K., Hameed, I. A., Luo, S., Sarwar, M. U., Shabbir, S. ve Khushi, M. (2020). An investigation of credit card default prediction in the imbalanced datasets. IEEE Access. 8: 201173-201198.

3. Barros, T. M., Souza Neto, P. A., Silva, I. ve Guedes, L. A. (2019). Predictive models for imbalanced data: A school dropout perspective. Education Sciences. 9(4): 275.

4. Batista, G. E., Bazzan, A. L. ve Monard, M. C. (2003, December). Balancing Training Data for Automated Annotation of Keywords: a Case Study. In WOB (ss. 10-18).

5. Bradley, A. P., Duin, R. P. W., Paclik, P. ve Landgrebe, T. C. W. (2006). Precision-Recall Operating Characteristic (P-ROC) Curves in Imprecise Environments. In 18th International Conference on Pattern Recognition (ICPR'06) (pp.123-127). Cambridge , United Kingdom.