Iterative feature removal yields highly discriminative pathways-Reference-Cited by-同舟云学术

Iterative feature removal yields highly discriminative pathways

Published:2013-11-25 Issue:1 Volume:14 Page:
ISSN:1471-2164
Container-title:BMC Genomics
language:en
Short-container-title:BMC Genomics

Author:

O’Hara Stephen,Wang Kun,Slayden Richard A,Schenkel Alan R,Huber Greg,O’Hern Corey S,Shattuck Mark D,Kirby Michael

Abstract

Abstract Background We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity. Results Microarray data sets typically contain far more features (genes) than samples. For this type of data, we demonstrate that there are many equivalently-predictive subsets of genes. We iteratively train a classifier using features identified via a sparse support vector machine. At each iteration, we remove all the features that were previously selected. We found that we could iterate many times before a sustained drop in accuracy occurs, with each iteration removing approximately 30 genes from consideration. The classification accuracy on test data remains essentially flat even as hundreds of top-genes are removed. Our method identifies sets of genes that are highly predictive, even when comprised of genes that individually are not. Through automated and manual analysis of the selected genes, we demonstrate that the selected features expose relevant pathways that other approaches would have missed. Conclusions Our results challenge the paradigm of using feature selection techniques to design parsimonious classifiers from microarray and similar high-dimensional, small-sample-size data sets. The fact that there are many subsets of genes that work equally well to classify the data provides a strong counter-result to the notion that there is a small number of “top genes” that should be used to build classifiers. In our results, the best classifiers were formed using genes with limited univariate power, thus illustrating that deeper mining of features using multivariate techniques is important.

Publisher

Springer Science and Business Media LLC

Subject

Genetics,Biotechnology

Link

https://link.springer.com/content/pdf/10.1186/1471-2164-14-832.pdf

Reference31 articles.

1. Xing EP, Jordan MI, Karp RM: Feature selection for high-dimensional genomic microarray data. Proc. International Conference on Machine Learning (ICML). 2001, 601-608.

2. Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn. 2002, 46 (1-3): 389-422.

3. Yu L, Liu H: Redundancy based feature selection for microarray data. Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 737-737.

4. Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005, 21 (10): 2394-2402. 10.1093/bioinformatics/bti319.

5. Sun Y, Li J: Iterative RELIEF for feature weighting. Proc. International Conference on Machine Learning (ICML). 2006, 913-920.

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Multi-domain Multi-task Approach for Feature Selection from Bulk RNA Datasets;Lecture Notes in Computer Science;2024

2. Feature Selection on Big Data using Masked Sparse Bottleneck Centroid-Encoder;2023 IEEE International Conference on Big Data (BigData);2023-12-15

3. Sparse Linear Centroid-Encoder: A Biomarker Selection tool for High Dimensional Biological Data;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05

4. Nonlinear feature selection using sparsity-promoted centroid-encoder;Neural Computing and Applications;2023-08-22

5. Using machine learning to determine the time of exposure to infection by a respiratory pathogen;Scientific Reports;2023-04-01