Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes-Reference-Cited by-同舟云学术

Comparison of Methods for Feature Selection in Clustering of High-Dimensional RNA-Sequencing Data to Identify Cancer Subtypes

Published:2021-02-24 Issue: Volume:12 Page:
ISSN:1664-8021
Container-title:Frontiers in Genetics
language:
Short-container-title:Front. Genet.

Author:

Källberg David,Vidman Linda,Rydén Patrik

Abstract

Cancer subtype identification is important to facilitate cancer diagnosis and select effective treatments. Clustering of cancer patients based on high-dimensional RNA-sequencing data can be used to detect novel subtypes, but only a subset of the features (e.g., genes) contains information related to the cancer subtype. Therefore, it is reasonable to assume that the clustering should be based on a set of carefully selected features rather than all features. Several feature selection methods have been proposed, but how and when to use these methods are still poorly understood. Thirteen feature selection methods were evaluated on four human cancer data sets, all with known subtypes (gold standards), which were only used for evaluation. The methods were characterized by considering mean expression and standard deviation (SD) of the selected genes, the overlap with other methods and their clustering performance, obtained comparing the clustering result with the gold standard using the adjusted Rand index (ARI). The results were compared to a supervised approach as a positive control and two negative controls in which either a random selection of genes or all genes were included. For all data sets, the best feature selection approach outperformed the negative control and for two data sets the gain was substantial with ARI increasing from (−0.01, 0.39) to (0.66, 0.72), respectively. No feature selection method completely outperformed the others but using the dip-rest statistic to select 1000 genes was overall a good choice. The commonly used approach, where genes with the highest SDs are selected, did not perform well in our study.

Funder

Vetenskapsrådet

Publisher

Frontiers Media SA

Subject

Genetics (clinical),Genetics,Molecular Medicine

Reference40 articles.

1. A comparative study of feature selection and classification methods for gene expression data of glioma.;Abusamra;Procedia Comput. Sci.,2013

2. Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection.;Ang;IEEE/ACM Trans. Comput. Biol. Bioinform.,2016

3. A comparative performance evaluation of supervised feature selection algorithms on microarray datasets.;Arun Kumar;Procedia Comput. Sci.,2017

4. Comprehensive characterization of cancer driver genes and mutations.;Bailey;Cell,2018

5. mixtools: an R package for analyzing finite mixture models.;Benaglia;J. Stat. Softw.,2009

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Tumor Subtype Classification Tool for HPV-associated Head and Neck Cancers;2024-07-10

2. Classification of Long Non-Coding RNAs s Between Early and Late Stage of Liver Cancers From Non-coding RNA Profiles Using Machine-Learning Approach;Bioinformatics and Biology Insights;2024-01

3. Comparison of Feature Selection Methods on Medical Record Data;2023 International Conference on Modeling & E-Information Research, Artificial Learning and Digital Applications (ICMERALDA);2023-11-24

4. Improved gene expression diagnosis via cascade entropy-fisher score and ensemble classifiers;Multimedia Tools and Applications;2023-10-23

5. Comparison of cancer subtype identification methods combined with feature selection methods in omics data analysis;BioData Mining;2023-07-07