BACKGROUND
Data imbalance is a critical issue in big data analysis, particularly when dealing with datasets containing fewer labels, such as healthcare real-world data, spam detection labels, and financial fraud detection datasets. While numerous data balancing methods have been proposed to enhance machine learning algorithm performance, research claims that Synthetic Minority Over-sampling Technique (SMOTE) and SMOTE-based data augmentation methods can improve algorithm performance. However, we observed that many online tutorials evaluating these methods use synthesized datasets, introducing bias into the evaluation process and leading to false positive improvements in performance.
OBJECTIVE
In this study, we propose a new evaluation framework for imbalanced data learning methods, experimenting with five data balancing techniques to assess their impact on machine learning algorithm performance.
METHODS
We collected 8 imbalanced real-world healthcare datasets with varying imbalance rates from different domains. We applied 6 data augmentation methods in conjunction with 11 machine learning techniques to test the efficacy of data augmentation in improving machine learning performance. Our proposed Evaluation Framework for Imbalanced Data Learning (EFIDL) uses a 5-fold cross-validation approach, comparing the traditional data augmentation evaluation methods with our new framework.
RESULTS
Traditional data augmentation evaluation methods can give a false impression of improved machine learning performance. However, our proposed evaluation framework demonstrates that data augmentation has limited ability to enhance results.
CONCLUSIONS
EFIDL is better suited for evaluating the prediction performance of machine learning methods when data are augmented. Using unsuitable evaluation frameworks can lead to false results. Future researchers should consider the evaluation framework we proposed when working with augmented datasets. Our experiments showed that data augmentation does not significantly improve machine learning prediction performance.