Smart data augmentation: One equation is all you need

Author:

Zhang Yuhao1ORCID,Tang Lu2,Huang Yuxiao3,Ma Yan2

Affiliation:

1. Department of Statistics George Washington University Washington DC USA

2. Department of Biostatistics University of Pittsburg Pittsburg Pennsylvania USA

3. Data Science Program George Washington University Washington DC USA

Abstract

AbstractClass imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.

Funder

National Institutes of Health

Publisher

Wiley

Reference46 articles.

1. Credit‐card‐fraud detection‐imbalanced‐dataset.https://www.kaggle.com/datasets/dark06thunder/credit‐card‐dataset/.

2. Hackerearth machine learning challenge.How not to lose a customer in 10 days.https://www.hackerearth.com/challenges/new/competitive/hackerearth‐machine‐learning‐challenge‐predict‐customer‐churn/.

3. Predicting profitable customer segments.https://www.kaggle.com/datasets/tsiaras/predicting‐profitable‐customer‐segments.

4. N.Abe B.Zadrozny andJ.Langford An iterative method for multi‐class cost‐sensitive learning Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining Association for Computing Machinery New York 2004 pp.3–11.

5. M. A.Al Mamun I.Kadir A. S. A.Rabby andA.Al Azmi Bangla music genre classification using neural network 2019 8th international conference system modeling and advancement in research trends (SMART) IEEE Moradabad 2019 pp.397–403.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3