Smart data augmentation: One equation is all you need-Reference-Cited by-同舟云学术

Smart data augmentation: One equation is all you need

Published:2024-03-27 Issue:2 Volume:17 Page:
ISSN:1932-1864
Container-title:Statistical Analysis and Data Mining: The ASA Data Science Journal
language:en
Short-container-title:Statistical Analysis

Author:

Zhang Yuhao¹^ORCID,Tang Lu²,Huang Yuxiao³,Ma Yan²

Affiliation:

1. Department of Statistics George Washington University Washington DC USA

2. Department of Biostatistics University of Pittsburg Pittsburg Pennsylvania USA

3. Data Science Program George Washington University Washington DC USA

Abstract

AbstractClass imbalance is a common and critical challenge in machine learning classification problems, resulting in low prediction accuracy. While numerous methods, especially data augmentation methods, have been proposed to address this issue, a method that works well on one dataset may perform poorly on another. To the best of our knowledge, there is still no one single best approach for handling class imbalance that can be uniformly applied. In this paper, we propose an approach named smart data augmentation (SDA), which aims to augment imbalanced data in an optimal way to maximize downstream classification accuracy. The key novelty of SDA is an equation that can bring about an augmentation method that provides a unified representation of existing sampling methods for handling multi‐level class imbalance and allows easy fine‐tuning. This framework allows SDA to be seen as a generalization of traditional methods, which in turn can be viewed as specific cases of SDA. Empirical results on a wide range of datasets demonstrate that SDA could significantly improve the performance of the most popular classifiers such as random forest, multi‐layer perceptron, and histogram‐based gradient boosting.

Funder

National Institutes of Health

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/sam.11672

Reference46 articles.

1. Credit‐card‐fraud detection‐imbalanced‐dataset.https://www.kaggle.com/datasets/dark06thunder/credit‐card‐dataset/.

2. Hackerearth machine learning challenge.How not to lose a customer in 10 days.https://www.hackerearth.com/challenges/new/competitive/hackerearth‐machine‐learning‐challenge‐predict‐customer‐churn/.

3. Predicting profitable customer segments.https://www.kaggle.com/datasets/tsiaras/predicting‐profitable‐customer‐segments.

4. N.Abe B.Zadrozny andJ.Langford An iterative method for multi‐class cost‐sensitive learning Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining Association for Computing Machinery New York 2004 pp.3–11.

5. M. A.Al Mamun I.Kadir A. S. A.Rabby andA.Al Azmi Bangla music genre classification using neural network 2019 8th international conference system modeling and advancement in research trends (SMART) IEEE Moradabad 2019 pp.397–403.