The effects of data leakage on connectome-based machine learning models-Reference-Cited by-同舟云学术

The effects of data leakage on connectome-based machine learning models

Published:2023-06-11 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Rosenblatt Matthew^ORCID,Tejavibulya Link,Jiang Rongtao,Noble Stephanie,Scheinost Dustin

Abstract

AbstractPredictive modeling has now become a central technique in neuroimaging to identify complex brain-behavior relationships and test their generalizability to unseen data. However, data leakage, which unintentionally breaches the separation between data used to train and test the model, undermines the validity of predictive models. Previous literature suggests that leakage is generally pervasive in machine learning, but few studies have empirically evaluated the effects of leakage in neuroimaging data. Although leakage is always an incorrect practice, understanding the effects of leakage on neuroimaging predictive models provides insight into the extent to which leakage may affect the literature. Here, we investigated the effects of leakage on machine learning models in two common neuroimaging modalities, functional and structural connectomes. Using over 400 different pipelines spanning four large datasets and three phenotypes, we evaluated five forms of leakage fitting into three broad categories: feature selection, covariate correction, and lack of independence between subjects. As expected, leakage via feature selection and repeated subjects drastically inflated prediction performance. Notably, other forms of leakage had only minor effects (e.g., leaky site correction) or even decreased prediction performance (e.g., leaky covariate regression). In some cases, leakage affected not only prediction performance, but also model coefficients, and thus neurobiological interpretations. Finally, we found that predictive models using small datasets were more sensitive to leakage. Overall, our results illustrate the variable effects of leakage on prediction pipelines and underscore the importance of avoiding data leakage to improve the validity and reproducibility of predictive modeling.

Publisher

Cold Spring Harbor Laboratory

Reference65 articles.

1. Predicting the future of neuroimaging predictive models in mental health;Mol. Psychiatry,2022

2. Using connectome-based predictive modeling to predict individual behavior from brain connectivity

3. Neuroimaging-based Individualized Prediction of Cognition and Behavior for Mental Disorders and Health: Methods and Promises;Biol. Psychiatry,2020

4. Connectome-Based Prediction of Memory Constructs Across Psychiatric Disorders;Cereb. Cortex,2020

5. Predicting attention across time and contexts with functional brain connectivity;Current Opinion in Behavioral Sciences,2021

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exposing Data Leakage in Wi-Fi CSI-Based Human Action Recognition: A Critical Analysis;Inventions;2024-08-15

2. Excellence is a habit: Enhancing predictions of language impairment by identifying stable features in clinical perfusion scans;2023-09-15