Author:
Guignard Fabian,Ginsbourger David,Levy Häner Lilia,Herrera Juan Manuel
Abstract
AbstractData leakage is a common issue that can lead to misleading generalisation error estimation and incorrect hyperparameter tuning. However, its mechanisms are not always well understood. In this work, we consider the case of clustered data and investigate the distribution of the number of elements in leakage when the data set is uniformly split. For both the validation and test sets, the first and second moments of the number of elements in leakage are derived analytically. Modelling consequences are investigated and exemplified on simulated data. In addition, the case of an actual agronomic feasibility study is presented. We demonstrate how data leakage can distort model performance estimation when an inadequate data splitting strategy is used. We provide an understanding of data leakage in the context of clustered data by quantifying its role in predictive modelling. This sheds light on related challenges that may impact the practice in agronomy and beyond.
Publisher
Springer Science and Business Media LLC