An Oversampling Technique with Descriptive Statistics

Author:

Sug Hyontai1

Affiliation:

1. Department of Computer Engineering, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan, 47011, REPUBLIC OF KOREA

Abstract

Oversampling is often applied as a means to win a better knowledge model. Several oversampling methods based on synthetic instances have been suggested, and SMOTE is one of the representative oversampling methods that can generate synthetic instances of a minor class. Until now, the oversampled data has been used conventionally to train machine learning models without statistical analysis, so it is not certain that the machine learning models will be fine for unseen cases in the future. However, because such synthetic data is different from the original data, we may wonder how much it resembles the original data so that the oversampled data is worth using to train machine learning models. For this purpose, I conducted this study on a representative dataset called wine data in the UCI machine learning repository, which is one of the datasets that has been experimented with by many researchers in research for knowledge discovery models. I generated synthetic data iteratively using SMOTE, and I compared the synthetic data with the original data of wine to see if it was statistically reliable using a box plot and t-test. Moreover, since training a machine learning model by supplying more high-quality training instances increases the probability of obtaining a machine learning model with higher accuracy, it was also checked whether a better machine learning model of random forests can be obtained by generating much more synthetic data than the original data and using it for training the random forests. The results of the experiment showed that small-scale oversampling produced synthetic data with statistical characteristics that were statistically slightly different from the original data, but when the oversampling rate was relatively high, it was possible to generate data with statistical characteristics similar to the original data, in other words, after generating high-quality training data, and by using it to train the random forests, it was possible to generate random forests with higher accuracy than using the original data alone, from 97.75% to 100%. Therefore, by supplying additional statistically reliable synthetic data as a way of oversampling, it was possible to create a machine-learning model with a higher predictive rate.

Publisher

World Scientific and Engineering Academy and Society (WSEAS)

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3