Improving the performance and interpretability on medical datasets using graphical ensemble feature selection

Author:

Battistella Enzo1ORCID,Ghiassian Dina2,Barabási Albert-László134

Affiliation:

1. Network Science Institute, Northeastern University , Boston, MA 02115, United States

2. Scipher Medicine , Waltham, MA 02453, United States

3. Department of Data and Network Science, Central Eastern University , Budapest 1051, Hungary

4. Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School , Boston, MA 02115, United States

Abstract

Abstract Motivation A major hindrance towards using Machine Learning (ML) on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features. Results Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baseline methods by 9% Balanced Accuracy while relying on fewer features. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of ML. Availability and implementation https://github.com/ebattistella/auto_machine_learning.

Funder

United States Department of Veteran Affairs and Scipher Medicine

Publisher

Oxford University Press (OUP)

Reference43 articles.

1. Comprehensive characterization of cancer driver genes and mutations;Bailey;Cell,2018

2. Network medicine: a network-based approach to human disease;Barabási;Nat Rev Genet,2011

3. Gene expression high-dimensional clustering towards a novel, robust, clinically relevant and highly compact cancer signature;Battistella

4. Combing: clustering in oncology for mathematical and biological identification of novel gene signatures;Battistella;IEEE/ACM Trans Comput Biol Bioinform,2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3