Research of data mining methods for classification of imbalanced data sets-Reference-Cited by-同舟云学术

Research of data mining methods for classification of imbalanced data sets

Published:2024 Issue:1 Volume:6 Page:48-57
ISSN:2707-1898
Container-title:Ukrainian Journal of Information Technology
language:
Short-container-title:UJIT

Author:

,Doroshenko A. V.^ORCID,Savchuk D. Y.^ORCID,

Abstract

With the rapid development of information technology, which is widely used in all spheres of human life and activity, extremely large amounts of data have been accumulated today. By applying machine learning methods to this data, new practically useful knowledge can be obtained. The main goal of this paper is to study different machine learning methods for solving the classification problem and compare their efficiency and accuracy. A separate task is data pre-processing aimed at solving the problem of sample imbalance, as well as identifying the principal components that will be used to solve the classification problem. For this purpose, an information system for classifying the bankruptcy of a company with specified economic and financial characteristics was researched and developed. The study uses a dataset on the basis of which the efficiency and quality of application of several existing classification algorithms are evaluated. These classifiers are: conventional and linear Support Vector Machine, Extra Trees, Random Forest, Decision Tree, Logistic Regression, Multilayer perceptron Classifier, Gradient Boosting, Naive Bayes Classifier. For data pre-processing, we scaled the data, used the SMOTE method to get rid of the imbalance of the training sample, and performed principal component analysis and L1 regularisation. Principal component analysis allowed us to identify 15 principal components that have the greatest impact on classification accuracy and, accordingly, use them in the classification process. Analysing the results, we found that the best classifier was Random Forest with 95.9 % accuracy, and the worst was Naive Bayes with 85.1 %. To evaluate the quality of classification and select the best classifier, the Confusion matrix is used, which takes into account the number of true positive (TP) and true negative (TN) values, as well as the number of false negative (FN) and false positive (FP) classification results, and the values of such metrics as accuracy, precision, sensitivity, F1, and ROC. Accuracy is the percentage of correct answers given by the algorithm, while Recall is the number of TPs divided by the number of TPs plus the number of FNs. F1 indicates the balance between accuracy and sensitivity. Precision is the number of true positive predictions divided by the number of false positive and true negative predictions. ROC AUC is a tool for measuring performance for classification tasks at different thresholds. It shows how well a model can distinguish between classes. The conclusions present the main results of the study and indicate the main future direction of the work, namely, the study of classification results for other datasets and more efficient processing and analysis.

Publisher

Lviv Polytechnic National University

Reference26 articles.

1. 1. Teslyuk, V., Doroshenko, A., & Savchuk, D. (2023). Intelligent Methods and Models for Assessing Level of Student Adaptation to Online Learning, 7th International Conference on Computational Linguistics and Intelligent Systems, April 20-21, 2023, Kharkiv, Ukraine. CEUR Workshop Proceedings, 3387, 331‑343.

2. 2. Akhavan, F., & Hassannayebi, E. (2024). A hybrid machine learning with process analytics for predicting customer experience in online insurance services industry. Decision Analytics Journal, 11, art. no. 100452. https://doi.org/10.1016/j.dajour.2024.100452

3. 3. Guha, A., & Veeranjaneyulu, N. (2019). Prediction of bankruptcy using big data analytic based on fuzzy C-means algorithm. IAES International Journal of Artificial Intelligence, 8(2), 168‑174. https://doi.org/10.11591/ijai.v8.i2.pp168-174

4. 4. Liang, D., Lu, C.-C., Tsai, C.-F., & Shih, G.-A. (2016). Financial Ratios and Corporate Governance Indicators in Bankruptcy Prediction: A Comprehensive Study. European Journal of Operational Research, 252(2), 561-572. https://doi.org/10.1016/j.ejor.2016.01.012

5. 5. Chen, T.-K., Liao, H.-H., Chen, G.-D., Kang, W.-H., & Lin, Y.-C. (2023). Bankruptcy Prediction Using Machine Learning Models with the Text-based Communicative Value of Annual Reports. Expert Systems with Applications, 120714. https://doi.org/10.1016/j.eswa.2023.120714