Unmasking Assumptions- Evaluating the Impact of Data Augmentation on Machine Learning Models in Healthcare Datasets (Preprint)-Reference-Cited by-同舟云学术

Unmasking Assumptions- Evaluating the Impact of Data Augmentation on Machine Learning Models in Healthcare Datasets (Preprint)

Published:2023-06-17 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Li Chenyu^ORCID,Jiang Xia^ORCID

Abstract

BACKGROUND

Data imbalance is a critical issue in big data analysis, particularly when dealing with datasets containing fewer labels, such as healthcare real-world data, spam detection labels, and financial fraud detection datasets. While numerous data balancing methods have been proposed to enhance machine learning algorithm performance, research claims that Synthetic Minority Over-sampling Technique (SMOTE) and SMOTE-based data augmentation methods can improve algorithm performance. However, we observed that many online tutorials evaluating these methods use synthesized datasets, introducing bias into the evaluation process and leading to false positive improvements in performance.

OBJECTIVE

In this study, we propose a new evaluation framework for imbalanced data learning methods, experimenting with five data balancing techniques to assess their impact on machine learning algorithm performance.

METHODS

We collected 8 imbalanced real-world healthcare datasets with varying imbalance rates from different domains. We applied 6 data augmentation methods in conjunction with 11 machine learning techniques to test the efficacy of data augmentation in improving machine learning performance. Our proposed Evaluation Framework for Imbalanced Data Learning (EFIDL) uses a 5-fold cross-validation approach, comparing the traditional data augmentation evaluation methods with our new framework.

RESULTS

Traditional data augmentation evaluation methods can give a false impression of improved machine learning performance. However, our proposed evaluation framework demonstrates that data augmentation has limited ability to enhance results.

CONCLUSIONS

EFIDL is better suited for evaluating the prediction performance of machine learning methods when data are augmented. Using unsuitable evaluation frameworks can lead to false results. Future researchers should consider the evaluation framework we proposed when working with augmented datasets. Our experiments showed that data augmentation does not significantly improve machine learning prediction performance.

Publisher

JMIR Publications Inc.

Reference37 articles.

1. Foundations of data imbalance and solutions for a data democracy

2. Bias and Class Imbalance in Oncologic Data—Towards Inclusive and Transferrable AI in Large Scale Oncology Data Sets

3. Learning from imbalanced data: open challenges and future directions

4. Learning from Imbalanced Data

5. Reviewing ensemble classification methods in breast cancer