Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification-Reference-Cited by-同舟云学术

Effect of Dataset Size and Train/Test Split Ratios in QSAR/QSPR Multiclass Classification

Published:2021-02-19 Issue:4 Volume:26 Page:1111
ISSN:1420-3049
Container-title:Molecules
language:en
Short-container-title:Molecules

Author:

Rácz Anita^ORCID,Bajusz Dávid^ORCID,Héberger Károly^ORCID

Abstract

Applied datasets can vary from a few hundred to thousands of samples in typical quantitative structure-activity/property (QSAR/QSPR) relationships and classification. However, the size of the datasets and the train/test split ratios can greatly affect the outcome of the models, and thus the classification performance itself. We compared several combinations of dataset sizes and split ratios with five different machine learning algorithms to find the differences or similarities and to select the best parameter settings in nonbinary (multiclass) classification. It is also known that the models are ranked differently according to the performance merit(s) used. Here, 25 performance parameters were calculated for each model, then factorial ANOVA was applied to compare the results. The results clearly show the differences not just between the applied machine learning algorithms but also between the dataset sizes and to a lesser extent the train/test split ratios. The XGBoost algorithm could outperform the others, even in multiclass modeling. The performance parameters reacted differently to the change of the sample set size; some of them were much more sensitive to this factor than the others. Moreover, significant differences could be detected between train/test split ratios as well, exerting a great effect on the test validation of our models.

Funder

National Research, Development and Innovation Office of Hungary

Publisher

MDPI AG

Subject

Chemistry (miscellaneous),Analytical Chemistry,Organic Chemistry,Physical and Theoretical Chemistry,Molecular Medicine,Drug Discovery,Pharmaceutical Science

Link

https://www.mdpi.com/1420-3049/26/4/1111/pdf

Reference43 articles.

1. Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author)

2. Multi-Level Comparison of Machine Learning Classifiers and Their Performance Metrics

3. Generic performance measure for multiclass-classifiers

4. SMOTE: Synthetic Minority Over-sampling Technique

5. ON METHODS FOR IMPROVING THE ACCURACY OF MULTICLASS CLASSIFICATION ON IMBALANCED DATA

Cited by 145 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A method for the automated digitalization of fluid circuit diagrams;Computers in Industry;2024-11

2. Exploring occupant behaviors and interactions in buildings with energy-efficient renovations: A hybrid virtual-physical experimental approach;Building and Environment;2024-11

3. Optimizing Lung Condition Categorization through a Deep Learning Approach to Chest X-ray Image Analysis;BioMedInformatics;2024-09-10

4. Trade-off between training and testing ratio in machine learning for medical image processing;PeerJ Computer Science;2024-09-06

5. Performance prediction of sludge volume index of oxygenic photogranule based wastewater treatment system using machine learning algorithms;Journal of Water Process Engineering;2024-09