Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain-Reference-Cited by-同舟云学术

Impact of Dataset Size on Classification Performance: An Empirical Evaluation in the Medical Domain

Published:2021-01-15 Issue:2 Volume:11 Page:796
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Althnian Alhanoof^ORCID,AlSaeed Duaa^ORCID,Al-Baity Heyam^ORCID,Samha Amani,Dris Alanoud Bin^ORCID,Alzakari Najla^ORCID,Abou Elwafa Afnan^ORCID,Kurdi Heba^ORCID

Abstract

Dataset size is considered a major concern in the medical domain, where lack of data is a common occurrence. This study aims to investigate the impact of dataset size on the overall performance of supervised classification models. We examined the performance of six widely-used models in the medical field, including support vector machine (SVM), neural networks (NN), C4.5 decision tree (DT), random forest (RF), adaboost (AB), and naïve Bayes (NB) on eighteen small medical UCI datasets. We further implemented three dataset size reduction scenarios on two large datasets and analyze the performance of the models when trained on each resulting dataset with respect to accuracy, precision, recall, f-score, specificity, and area under the ROC curve (AUC). Our results indicated that the overall performance of classifiers depend on how much a dataset represents the original distribution rather than its size. Moreover, we found that the most robust model for limited medical data is AB and NB, followed by SVM, and then RF and NN, while the least robust model is DT. Furthermore, an interesting observation is that a robust machine learning model to limited dataset does not necessary imply that it provides the best performance compared to other models.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/11/2/796/pdf

Reference29 articles.

1. On sample size and classification accuracy: A performance comparison;Sordo,2005

2. Performance of Firth-and logF-type penalized methods in risk prediction for small or sparse binary data

3. Discovering Knowledge in Data: an Introduction to Data Mining

4. BCT Boost Segmentation with U-net in TensorFlow;Wieczorek;Mach. Graph. Vis.,2019

Cited by 180 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. MuMCyp_Net: A multimodal neural network for the prediction of Cyp450 inhibition;Expert Systems with Applications;2024-12

2. Support vector machine in the elementomic evaluation of arugula (Eruca Sativa) and lettuce (Lactuca sativa) grown in soils from a decommissioned mining area;Journal of Food Composition and Analysis;2024-11

3. Post-processing of short-term quantitative precipitation forecast with the multi-stream convolutional neural network;Atmospheric Research;2024-10

4. The Application of Artificial Intelligence to Acoustic Data in Otolaryngology;Otolaryngologic Clinics of North America;2024-10

5. Machine learning to predict the production of bio-oil, biogas, and biochar by pyrolysis of biomass: a review;Environmental Chemistry Letters;2024-09-05