Some combinatorics of data leakage induced by clusters-Reference-Cited by-同舟云学术

Some combinatorics of data leakage induced by clusters

Published:2024-04-11 Issue:7 Volume:38 Page:2815-2828
ISSN:1436-3240
Container-title:Stochastic Environmental Research and Risk Assessment
language:en
Short-container-title:Stoch Environ Res Risk Assess

Author:

Guignard Fabian,Ginsbourger David,Levy Häner Lilia,Herrera Juan Manuel

Abstract

AbstractData leakage is a common issue that can lead to misleading generalisation error estimation and incorrect hyperparameter tuning. However, its mechanisms are not always well understood. In this work, we consider the case of clustered data and investigate the distribution of the number of elements in leakage when the data set is uniformly split. For both the validation and test sets, the first and second moments of the number of elements in leakage are derived analytically. Modelling consequences are investigated and exemplified on simulated data. In addition, the case of an actual agronomic feasibility study is presented. We demonstrate how data leakage can distort model performance estimation when an inadequate data splitting strategy is used. We provide an understanding of data leakage in the context of clustered data by quantifying its role in predictive modelling. This sheds light on related challenges that may impact the practice in agronomy and beyond.

Funder

University of Bern

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s00477-024-02715-1.pdf

Reference20 articles.

1. Ayotte B (2021) Fast user authentication via keystroke dynamics (Unpublished doctoral dissertation). Clarkson University

2. Ayotte B, Banavar MK, Hou D, Schuckers S (2021) Group leakage overestimates performance: a case study in keystroke dynamics. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1410–1417

3. Buntaran H, Piepho H-P, Hagman J, Forkman J (2019) A cross-validation of statistical models for zoned-based prediction in cultivar testing. Crop Sci 59(4):1544–1553. https://doi.org/10.2135/cropsci2018.10.0642

4. Cawley GC, Talbot NL (2010) On over-fitting in model selection and subsequent selection bias in performance evaluation. J Mach Learn Res 11:2079–2107

5. Friedman JH (1991) Multivariate adaptive regression splines. Ann Statist 19(1):1–67. https://doi.org/10.1214/aos/1176347963