Author:
O’Hara Stephen,Wang Kun,Slayden Richard A,Schenkel Alan R,Huber Greg,O’Hern Corey S,Shattuck Mark D,Kirby Michael
Abstract
Abstract
Background
We introduce Iterative Feature Removal (IFR) as an unbiased approach for selecting features with diagnostic capacity from large data sets. The algorithm is based on recently developed tools in machine learning that are driven by sparse feature selection goals. When applied to genomic data, our method is designed to identify genes that can provide deeper insight into complex interactions while remaining directly connected to diagnostic utility. We contrast this approach with the search for a minimal best set of discriminative genes, which can provide only an incomplete picture of the biological complexity.
Results
Microarray data sets typically contain far more features (genes) than samples. For this type of data, we demonstrate that there are many equivalently-predictive subsets of genes. We iteratively train a classifier using features identified via a sparse support vector machine. At each iteration, we remove all the features that were previously selected. We found that we could iterate many times before a sustained drop in accuracy occurs, with each iteration removing approximately 30 genes from consideration. The classification accuracy on test data remains essentially flat even as hundreds of top-genes are removed.
Our method identifies sets of genes that are highly predictive, even when comprised of genes that individually are not. Through automated and manual analysis of the selected genes, we demonstrate that the selected features expose relevant pathways that other approaches would have missed.
Conclusions
Our results challenge the paradigm of using feature selection techniques to design parsimonious classifiers from microarray and similar high-dimensional, small-sample-size data sets. The fact that there are many subsets of genes that work equally well to classify the data provides a strong counter-result to the notion that there is a small number of “top genes” that should be used to build classifiers. In our results, the best classifiers were formed using genes with limited univariate power, thus illustrating that deeper mining of features using multivariate techniques is important.
Publisher
Springer Science and Business Media LLC
Reference31 articles.
1. Xing EP, Jordan MI, Karp RM: Feature selection for high-dimensional genomic microarray data. Proc. International Conference on Machine Learning (ICML). 2001, 601-608.
2. Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn. 2002, 46 (1-3): 389-422.
3. Yu L, Liu H: Redundancy based feature selection for microarray data. Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2004, 737-737.
4. Yeung KY, Bumgarner RE, Raftery AE: Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics. 2005, 21 (10): 2394-2402. 10.1093/bioinformatics/bti319.
5. Sun Y, Li J: Iterative RELIEF for feature weighting. Proc. International Conference on Machine Learning (ICML). 2006, 913-920.
Cited by
16 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献