Confound-leakage: confound removal in machine learning leads to leakage

Author:

Hamdan Sami12ORCID,Love Bradley C345ORCID,von Polier Georg G167ORCID,Weis Susanne12ORCID,Schwender Holger8ORCID,Eickhoff Simon B12ORCID,Patil Kaustubh R12ORCID

Affiliation:

1. Institute of Neuroscience and Medicine, Brain and Behaviour (INM-7), Forschungszentrum Jülich , 52428 Jülich, Germany

2. Institute of Systems Neuroscience, Medical Faculty, Heinrich-Heine University Düsseldorf , 40225 Düsseldorf, Germany

3. Department of Experimental Psychology, University College London , WC1H 0AP London, UK

4. The Alan Turing Institute , London NW1 2DB, UK

5. European Lab for Learning & Intelligent Systems (ELLIS) , WC1E 6BT, London, UK

6. Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, University Hospital Frankfurt , 60528 Frankfurt, Germany

7. Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, RWTH Aachen University , 52074 Aachen, Germany

8. Institute of Mathematics, Heinrich-Heine University Düsseldorf , 40225 Düsseldorf, Germany

Abstract

Abstract Background Machine learning (ML) approaches are a crucial component of modern data analysis in many fields, including epidemiology and medicine. Nonlinear ML methods often achieve accurate predictions, for instance, in personalized medicine, as they are capable of modeling complex relationships between features and the target. Problematically, ML models and their predictions can be biased by confounding information present in the features. To remove this spurious signal, researchers often employ featurewise linear confound regression (CR). While this is considered a standard approach for dealing with confounding, possible pitfalls of using CR in ML pipelines are not fully understood. Results We provide new evidence that, contrary to general expectations, linear confound regression can increase the risk of confounding when combined with nonlinear ML approaches. Using a simple framework that uses the target as a confound, we show that information leaked via CR can increase null or moderate effects to near-perfect prediction. By shuffling the features, we provide evidence that this increase is indeed due to confound-leakage and not due to revealing of information. We then demonstrate the danger of confound-leakage in a real-world clinical application where the accuracy of predicting attention-deficit/hyperactivity disorder is overestimated using speech-derived features when using depression as a confound. Conclusions Mishandling or even amplifying confounding effects when building ML models due to confound-leakage, as shown, can lead to untrustworthy, biased, and unfair predictions. Our expose of the confound-leakage pitfall and provided guidelines for dealing with it can help create more robust and trustworthy ML models.

Funder

Deutsche Forschungsgemeinschaft

Publisher

Oxford University Press (OUP)

Subject

Computer Science Applications,Health Informatics

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3