Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge-Reference-Cited by-同舟云学术

Characterization of the Effectiveness of Reporting Lists of Small Feature Sets Relative to the Accuracy of the Prior Biological Knowledge

Published:2010-01 Issue: Volume:9 Page:CIN.S4020
ISSN:1176-9351
Container-title:Cancer Informatics
language:en
Short-container-title:Cancer Inform

Author:

Zhao Chen¹,Bittner Michael L.²,Chapkin Robert S.³,Dougherty Edward R.¹²⁴

Affiliation:

1. Department of Electrical and Computer Engineering, Texas. A&M University, College Station, TX, 77843, USA.

2. Computational Biology Division, Translational Genomics Research Institute, Phoenix, AZ, 85004, USA.

3. Center for Environmental and Rural Health, Texas A&M University, College Station, TX, 77843, USA.

4. Department of Bioinformatics and Computational Biology, University of Texas M.D. Anderson Cancer Center, Houston, TX, 77030, USA.

Abstract

When confronted with a small sample, feature-selection algorithms often fail to find good feature sets, a problem exacerbated for high-dimensional data and large feature sets. The problem is compounded by the fact that, if one obtains a feature set with a low error estimate, the estimate is unreliable because training-data-based error estimators typically perform poorly on small samples, exhibiting optimistic bias or high variance. One way around the problem is limit the number of features being considered, restrict features sets to sizes such that all feature sets can be examined by exhaustive search, and report a list of the best performing feature sets. If the list is short, then it greatly restricts the possible feature sets to be considered as candidates; however, one can expect the lowest error estimates obtained to be optimistically biased so that there may not be a close-to-optimal feature set on the list. This paper provides a power analysis of this methodology; in particular, it examines the kind of results one should expect to obtain relative to the length of the list and the number of discriminating features among those considered. Two measures are employed. The first is the probability that there is at least one feature set on the list whose true classification error is within some given tolerance of the best feature set and the second is the expected number of feature sets on the list whose true errors are within the given tolerance of the best feature set. These values are plotted as functions of the list length to generate power curves. The results show that, if the number of discriminating features is not too small—that is, the prior biological knowledge is not too poor—then one should expect, with high probability, to find good feature sets. Availability: companion website at http://gsp.tamu.edu/Publications/supplementary/zhao09a/

Publisher

SAGE Publications

Subject

Cancer Research,Oncology

Link

http://journals.sagepub.com/doi/pdf/10.4137/CIN.S4020

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Machine Learning Uses Chemo-Transcriptomic Profiles to Stratify Antimalarial Compounds With Similar Mode of Action;Frontiers in Cellular and Infection Microbiology;2021-06-29

2. Assessing the Multivariate Relationship between the Human Infant Intestinal Exfoliated Cell Transcriptome (Exfoliome) and Microbiome in Response to Diet;Microorganisms;2020-12-18

3. The Model-Based Study of the Effectiveness of Reporting Lists of Small Feature Sets Using RNA-Seq Data;Cancer Informatics;2017-01-01

4. Model-based study of the Effectiveness of Reporting Lists of Small Feature Sets using RNA-Seq Data;Proceedings of the 7th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics;2016-10-02

5. References;Epistemology of the Cell;2011-10-17