Tıbbi Verilerde Heinz Ortalamasına Dayalı Yeni Sentetik Veriler Üreterek Veri Kümesini Dengeleme-Reference-Cited by-同舟云学术

Tıbbi Verilerde Heinz Ortalamasına Dayalı Yeni Sentetik Veriler Üreterek Veri Kümesini Dengeleme

Published:2022-06-30 Issue:3 Volume:22 Page:570-576
ISSN:2149-3367
Container-title:Afyon Kocatepe University Journal of Sciences and Engineering
language:tr
Short-container-title:

Author:

GÜMÜŞ İbrahim Halil¹,GÜLDAL Serkan²

Affiliation:

1. ADIYAMAN UNIVERSITY

2. Adıyaman Üniversitesi

Abstract

Advances in science and technology have caused data sizes to increase at a great rate. Thus, unbalanced data has arisen. A dataset is unbalanced if the classes are not nearly equally represented. In this case, classifying the data causes performance values to decrease because the classification algorithms are developed on the assumption that the datasets are balanced. As the accuracy of the classification favors the majority class, the minority class is often misclassified. The majority of datasets, especially those used in the medical field, have an unbalanced distribution. To balance this distribution, several studies have been performed recently. These studies are undersampling and oversampling processes. In this study, distance and mean based resampling method is used to produce synthetic samples using minority class. For the resampling process, the closest neighbors for all data points belonging to the minority class were determined by using the Euclidean distance. Based on these neighbors and using the Heinz Mean, the desired number of new synthetic samples were formed between each sample to obtain balance. The Random Forest (RF) and Support Vector Machine (SVM) algorithms are used to classify the raw and balanced datasets, and the results were compared. Additionally, the other well known methods (Random Over Sampling-ROS, Random Under Sampling-RUS, and Synthetic Minority Oversampling TEchnique-SMOTE) are compared with the proposed method. It was shown that the balanced dataset using the proposed resampling method increases classification efficiency as compared to the raw dataset and other methods. Accuracy measurements of RF are 0.751 and 0.799 and, accuracy measurements of SVM are 0.762 and 0.781 for raw data and resampled data respectively. Likewise, there are improvements in the other metrics such as Precision, Recall, and F1 Score.

Publisher

Afyon Kocatepe Universitesi Fen Ve Muhendislik Bilimleri Dergisi

Subject

General Engineering

Reference18 articles.

1. Batista GE, Prati RC, Monard MC, 2004. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6 (1), 20-29.

2. Breiman L, 2001. Random forests. Machine learning, 45 (1), 5-32.

3. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP, 2002. SMOTE: Synthetic Minority Over-Sampling Technique. Journal of artificial intelligence research, 16, 321-357.

4. Chawla NV, Japkowicz N, Kotcz A, 2004. Special Issue on Learning from Imbalanced Data Sets. ACM SIGKDD explorations newsletter, 6 (1), 1-6.

5. Dal A, Gümüş İH, Güldal S, Yavaş M, 2021. A New Resampling Approach Based on Weighted Geometric Mean for Unbalanced Data. Journal of Engineering Science of Adiyaman University, 8 (15), 343-352. doi:10.54365/adyumbd.940539.