Topological embedding and directional feature importance in ensemble classifiers for multi-class classification-Reference-Cited by-同舟云学术

Topological embedding and directional feature importance in ensemble classifiers for multi-class classification

Published:2024-08-04 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Rocha Liedl Eloisa,Yassin Shabeer Mohamed,Kasapi Melpomeni,Posma Joram M.^ORCID

Abstract

AbstractCancer is the second leading cause of disease-related death worldwide, and machine learning-based identification of novel biomarkers is crucial for improving early detection and treatment of various cancers. A key challenge in applying machine learning to high-dimensional data is deriving important features in an interpretable manner to provide meaningful insights into the underlying biological mechanismsWe developed a class-based directional feature importance (CLIFI) metric for decision tree methods and demonstrated its use for the The Cancer Genome Atlas proteomics data. We incorporated this metric into four algorithms, Random Forest (RF), LAtent VAriable Stochastic Ensemble of Trees (LAVASET), and Gradient Boosted Decision Trees (GBDTs), and a new extension incorporating the LAVA step into GBDTs (LAVABOOST). Both LAVA methods incorporate topological information from protein interactions into the decision function.The different models’ performance in classifying 28 cancers resulted in F1-scores of 93% (RF), 92% (LAVASET), 89% (LAVABOOST) and 86% (GBDT), with no method outperforming all others for individual cancer type prediction. The CLIFI metric allowed the visualisation of the model decision making functions, and the distributions indicated heterogeneity in several proteins (MYH11, ERα, BCL2) for different cancer types (including brain glioma, breast, kidney, thyroid and prostate cancer).We have developed an integrated, directional feature importance metric for multi-class decision tree-based classification models that facilitates interpretable feature importance assessment. The CLIFI metric can be used in conjunction with incorporating topological information into the decision functions of models to add inductive bias for improved interpretability.AvailabilityAll codes are available for data curation fromhttps://github.com/EloisaRL/TCGA-proteomics-pipelineand the LAVASET (v1.0) package fromhttps://github.com/melkasapi/LAVASET.

Publisher

Cold Spring Harbor Laboratory

Reference38 articles.

1. Global cancer transitions according to the Human Development Index (2008–2030): a population-based study

2. Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries

3. Cancer Incidence, Mortality, Years of Life Lost, Years Lived With Disability, and Disability-Adjusted Life Years for 29 Cancer Groups From 2010 to 2019

4. Machine learning applications in cancer prognosis and prediction

5. International evaluation of an AI system for breast cancer screening