Abstract
AbstractCancer is the second leading cause of disease-related death worldwide, and machine learning-based identification of novel biomarkers is crucial for improving early detection and treatment of various cancers. A key challenge in applying machine learning to high-dimensional data is deriving important features in an interpretable manner to provide meaningful insights into the underlying biological mechanismsWe developed a class-based directional feature importance (CLIFI) metric for decision tree methods and demonstrated its use for the The Cancer Genome Atlas proteomics data. We incorporated this metric into four algorithms, Random Forest (RF), LAtent VAriable Stochastic Ensemble of Trees (LAVASET), and Gradient Boosted Decision Trees (GBDTs), and a new extension incorporating the LAVA step into GBDTs (LAVABOOST). Both LAVA methods incorporate topological information from protein interactions into the decision function.The different models’ performance in classifying 28 cancers resulted in F1-scores of 93% (RF), 92% (LAVASET), 89% (LAVABOOST) and 86% (GBDT), with no method outperforming all others for individual cancer type prediction. The CLIFI metric allowed the visualisation of the model decision making functions, and the distributions indicated heterogeneity in several proteins (MYH11, ERα, BCL2) for different cancer types (including brain glioma, breast, kidney, thyroid and prostate cancer).We have developed an integrated, directional feature importance metric for multi-class decision tree-based classification models that facilitates interpretable feature importance assessment. The CLIFI metric can be used in conjunction with incorporating topological information into the decision functions of models to add inductive bias for improved interpretability.AvailabilityAll codes are available for data curation fromhttps://github.com/EloisaRL/TCGA-proteomics-pipelineand the LAVASET (v1.0) package fromhttps://github.com/melkasapi/LAVASET.
Publisher
Cold Spring Harbor Laboratory