Iterative feature removal yields highly discriminative pathways

Author:

O’Hara Stephen,Wang Kun,Slayden Richard A,Schenkel Alan R,Huber Greg,O’Hern Corey S,Shattuck Mark D,Kirby Michael

Abstract

Abstract Background We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity. Results Microarray data sets typically contain far more features (genes) than samples. For this type of data, we demonstrate that there are many equivalently-predictive subsets of genes. We iteratively train a classifier using features identified via a sparse support vector machine. At each iteration, we remove all the features that were previously selected. We found that we could iterate many times before a sustained drop in accuracy occurs, with each iteration removing approximately 30 genes from consideration. The classification accuracy on test data remains essentially flat even as hundreds of top-genes are removed. Our method identifies sets of genes that are highly predictive, even when comprised of genes that individually are not. Through automated and manual analysis of the selected genes, we demonstrate that the selected features expose relevant pathways that other approaches would have missed. Conclusions Our results challenge the paradigm of using feature selection techniques to design parsimonious classifiers from microarray and similar high-dimensional, small-sample-size data sets. The fact that there are many subsets of genes that work equally well to classify the data provides a strong counter-result to the notion that there is a small number of “top genes” that should be used to build classifiers. In our results, the best classifiers were formed using genes with limited univariate power, thus illustrating that deeper mining of features using multivariate techniques is important.

Publisher

Springer Science and Business Media LLC

Subject

Genetics,Biotechnology

Reference31 articles.

1. Xing EP, Jordan MI, Karp RM: Feature selection for high-dimensional genomic microarray data. Proc. International Conference on Machine Learning (ICML). 2001, 601-608.

2. Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn. 2002, 46 (1-3): 389-422.

3. Yu L, Liu H: Redundancy based feature selection for microarray data. Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 737-737.

4. Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005, 21 (10): 2394-2402. 10.1093/bioinformatics/bti319.

5. Sun Y, Li J: Iterative RELIEF for feature weighting. Proc. International Conference on Machine Learning (ICML). 2006, 913-920.

Cited by 16 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. A Multi-domain Multi-task Approach for Feature Selection from Bulk RNA Datasets;Lecture Notes in Computer Science;2024

2. Feature Selection on Big Data using Masked Sparse Bottleneck Centroid-Encoder;2023 IEEE International Conference on Big Data (BigData);2023-12-15

3. Sparse Linear Centroid-Encoder: A Biomarker Selection tool for High Dimensional Biological Data;2023 IEEE International Conference on Bioinformatics and Biomedicine (BIBM);2023-12-05

4. Nonlinear feature selection using sparsity-promoted centroid-encoder;Neural Computing and Applications;2023-08-22

5. Using machine learning to determine the time of exposure to infection by a respiratory pathogen;Scientific Reports;2023-04-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3