Implications of Data Leakage in Machine Learning Preprocessing: A Multi-Domain Investigation-Reference-Cited by-同舟云学术

Implications of Data Leakage in Machine Learning Preprocessing: A Multi-Domain Investigation

Published:2024-07-12 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Bouke Mohamed Aly¹,Zaid Saleh Ali¹,Abdullah Azizol¹

Affiliation:

1. Universiti Putra Malaysia

Abstract

Data leakage during machine learning (ML) preprocessing is a critical issue where unintended external information skews the training process, resulting in artificially high-performance metrics and undermining model reliability. This study addresses the insufficient exploration of data leakage across diverse ML domains, highlighting the necessity of comprehensive investigations to ensure robust and dependable ML models in real-world applications. Significant discrepancies in model performance due to data leakage were observed, with notable variations in F1 scores and ROC AUC values for the Breast Cancer dataset. The Tic-Tac-Toe Endgame dataset analysis revealed the varying impact on models like Ridge, SGD, GaussianNB, and MLP, underscoring the profound effect of data leakage. The German Credit Scoring dataset showed slight enhancements in recall and F1 scores for models like DT and GB without data leakage, indicating reduced overfitting. Additionally, models such as PassiveAggressive, Ridge, SGD, GaussianNB, and Nearest Centroid exhibited shifts in performance metrics, highlighting the intricate response to data leakage. The study also revealed raw data leakage rates, such as 6.79% for Spambase and 1.99% for Breast Cancer. These findings emphasize meticulous data management and validation to mitigate leakage effects, which is crucial for developing reliable ML models.

Publisher

Springer Science and Business Media LLC

Reference37 articles.

1. BukaGini: A Stability-Aware Gini Index Feature Selection Algorithm for Robust Model Performance;Bouke MA;IEEE Access,2023

2. A Comprehensive Survey on Recent Metaheuristics for Feature Selection;Dokeroglu T;Neurocomputing,2022

3. Modern Implementations of Feature Selection Algorithms and Their Perspectives;Pilnenskiy N;Conf Open Innov Assoc Fruct,2019

4. Bouke MA, Abdullah A, Cengiz K, Akleylek S (2043) Application of BukaGini algorithm for enhanced feature interaction analysis in intrusion detection systems, PeerJ Comput. Sci., vol. 10, p. e Apr. 2024, 10.7717/peerj-cs.2043

5. Refaat M (2010) Data preparation for data mining using SAS. Elsevier