G-S-M: A Comprehensive Framework for Integrative Feature Selection in Omics Data Analysis and Beyond-Reference-Cited by-同舟云学术

G-S-M: A Comprehensive Framework for Integrative Feature Selection in Omics Data Analysis and Beyond

Published:2024-04-01 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Yousef Malik,Allmer Jens^ORCID,İnal Yasin,Gungor Burcu Bakir

Abstract

AbstractThe treatment of human diseases is a major research question in many fields related to medicine. It has become clear that patient stratification is of utmost importance so that patients receive the best possible treatment. Bio/disease markers are critical to achieve stratification. Markers can come from many different sources such as genomics, transcriptomics, and proteomics. Establishing markers from such measurements often involves data analysis, machine learning, and feature selection. Traditional feature selection techniques often rely on the estimation of individual feature importance or significance by assigning a score to each feature, disregarding the inter-feature relationships. In contrast, the G-S-M (grouping scoring modeling) approach considers a group of features as a set that is organized based on prior knowledge. This approach takes into account the interdependence among features, providing a more meaningful evaluation of feature relevance and utility. Prior knowledge can encompass much compiled information such as microRNA-target interactions and protein-protein interactions. Here we present a new tool called G-S-M that presents the generalization of our previous works such as maTE, CogNet, and PriPath. The G-S-M tool combines machine learning and prior knowledge to group and score features based on their association with a binary-labeled target such as control and disease. This approach is unique in that computational and domain knowledge is utilized concurrently. Embedded feature selection, repeatedly employing machine learning during the selection process results in the identification of the most discriminative groups.Furthermore, the G-S-M tool allows for a more holistic understanding of the underlying mechanisms of a given system to be achieved through the combination of machine learning and prior domain knowledge, which can lead to new insights and discoveries. The implementation of the G-S-M workflow is freely available for download from our GitHub repository:https://github.com/malikyousef/The-G-S-M-Grouping-Scoring-Modeling-Approach. With this generalized approach we aim to make the feature selection approach available to a broader audience and hope it will be employed in medical practice. An example of such an approach is the TextNetTopics that is based on the G-S-M approach. TextNetTopics uses Latent Dirichlet Allocation (LDA) to detect topics of words, where those topics serve as groups. In the future, we aim to extend the approach to enable the incorporation of multiple lines of evidence for biomarker detection and patient stratification via combining multi-omics data.

Publisher

Cold Spring Harbor Laboratory

Reference17 articles.

1. DAVID: Database for Annotation, Visualization, and Integrated Discovery

2. STRING: a database of predicted functional associations between proteins

3. Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks

4. CogNet: classification of gene expression data based on ranked active-subnetwork-oriented KEGG pathway enrichment analysis

5. maTE: discovering expressed interactions between microRNAs and their targets

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. TextNetTopics-SFTS-SBTS: TextNetTopics Scoring Approaches Based Sequential Forward and Backward;Lecture Notes in Computer Science;2024

2. SEMANT - Feature Group Selection Utilizing FastText-Based Semantic Word Grouping, Scoring, and Modeling Approach for Text Classification;Lecture Notes in Computer Science;2024