SFC: A Sampling from Clusters for Reduction of Dataset Size-Reference-Cited by-同舟云学术

SFC: A Sampling from Clusters for Reduction of Dataset Size

Published:2023-06-22 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Tigga Onima¹,Pal Jaya¹,Mustafi Debjani¹

Affiliation:

1. Birla Institute of Technology

Abstract

Abstract Since managing enormous datasets in the real world is difficult, it is necessary to minimize the size of the data set, so that the accuracy of the original dataset is no longer impacted. In this study, the categorization of the white wine dataset is examined using a number of machine learning techniques, including Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), K Nearest Neighbour (KNN), and Logistic Regression (LR). Additionally, we utilized the stated dataset using the defined methodologies and presented the Sampling from Clusters (SFC) approach. The white wine dataset is first clustered using our suggested method SFC, and then 95% of the data from each cluster is removed and combined to create a standard dataset for classification process. For 90%, 85%, and 80% of the data, the same procedure is repeated. On the other hand, we used a random sampling (RS) technique to work with 95% of the data from the dataset in question, and we compared the results with SFC using evaluation metrics like accuracy, precision, recall, F1-score, Receiver Operating Characteristic (ROC), Area under the Curve (AUC), binomial confidence interval (CI), and MSE. With 90%, 85%, and 80% of the datasets, the same procedure is repeated. According to statistics, confidence intervals CI become tighter as the quantity of test data N increases; they range from 0.72 to 0.76 for NB, 0.73 to 0.79 for SVM, 0.82 to 0.86 for RF, 0.75 to 0.77 for KNN, and 0.74 to 0.80 for LR.

Publisher

Research Square Platform LLC

Reference42 articles.

1. Tan, P. N., Steinbach, M., Karpatne, A., & Kumar, V. (2022). Introduction to Data Mining (2nd ed.). Pearson Publications.

2. Dunham, M. H. (2013). Data Mining Introductory and Advanced Topics (14th ed.). Pearson Education.

3. Han, J., Kamber, M., & Pei, J. (2008). Data Mining Concepts & Techniques. 3rd ed., Morgan Kaufmann Publishers, ISBN: 978-93-80931-91-3.

4. Ahsaan, S. U., Kaur, H., Mourya, A. K., & Naaz, S. (2022). A Hybrid Support Vector Machine Algorithm for Big Data Heterogeneity Using Machine Learning. (MDPI), Symmetry 2022, 14, 2344. https://doi.org/10.3390/sym14112344.

5. Sharma, N. (2018). Quality Prediction of Red Wine based on Different Features Sets Using Machine Learning Techniques. International Journal of Science and Research (IJSR), ISSN: 2319–7064, Research Gate Impact Factor.