Bias in error estimation when using cross-validation for model selection-Reference-Cited by-同舟云学术

Bias in error estimation when using cross-validation for model selection

Published:2006-02-23 Issue:1 Volume:7 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Varma Sudhir,Simon Richard

Abstract

Abstract Background Cross-validation (CV) is an effective method for estimating the prediction error of a classifier. Some recent articles have proposed methods for optimizing classifiers by choosing classifier parameter values that minimize the CV error estimate. We have evaluated the validity of using the CV error estimate of the optimized classifier as an estimate of the true error expected on independent data. Results We used CV to optimize the classification parameters for two kinds of classifiers; Shrunken Centroids and Support Vector Machines (SVM). Random training datasets were created, with no difference in the distribution of the features between the two classes. Using these "null" datasets, we selected classifier parameter values that minimized the CV error estimate. 10-fold CV was used for Shrunken Centroids while Leave-One-Out-CV (LOOCV) was used for the SVM. Independent test data was created to estimate the true error. With "null" and "non null" (with differential expression between the classes) data, we also tested a nested CV procedure, where an inner CV loop is used to perform the tuning of the parameters while an outer CV is used to compute an estimate of the error. The CV error estimate for the classifier with the optimal parameters was found to be a substantially biased estimate of the true error that the classifier would incur on independent data. Even though there is no real difference between the two classes for the "null" datasets, the CV error estimate for the Shrunken Centroid with the optimal parameters was less than 30% on 18.5% of simulated training data-sets. For SVM with optimal parameters the estimated error rate was less than 30% on 38% of "null" data-sets. Performance of the optimized classifiers on the independent test set was no better than chance. The nested CV procedure reduces the bias considerably and gives an estimate of the error that is very close to that obtained on the independent testing set for both Shrunken Centroids and SVM classifiers for "null" and "non-null" data distributions. Conclusion We show that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. Proper use of CV for estimating true error of a classifier developed using a well defined algorithm requires that all steps of the algorithm, including classifier parameter tuning, be repeated in each CV loop. A nested CV procedure provides an almost unbiased estimate of the true error.

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/1471-2105-7-91.pdf

Reference12 articles.

1. Duda RO, Hart PE, Stork DG: Pattern classification. John Wiley and Sons Inc 2001, Ch.9: 483–486.

2. Simon R, Radmacher MD, Dobbin K, McShane LM: Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 2003, 95: 14–18.

3. Ambroise C, McLachlan GJ: Selection bias in gene extraction on the basis of microarray gene-expression data. PNAS 99(10):6562–6566. 2002 May 14 2002 May 14 10.1073/pnas.102102699

4. Reunanen J: Overfitting in making comparisons between variable selection methods. J Machine Learning Research 2003, 3: 1371–1382. 10.1162/153244303322753715

5. Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. PNAS 99(10):6567–6572. 2002 May 14 2002 May 14 10.1073/pnas.082099299

Cited by 1206 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Machine learning framework to extract the biomarker potential of plasma IgG N-glycans towards disease risk stratification;Computational and Structural Biotechnology Journal;2024-12

2. Discovery of alkaline laccases from basidiomycete fungi through machine learning-based approach;Biotechnology for Biofuels and Bioproducts;2024-09-11

3. EEG-based Signatures of Schizophrenia, Depression, and Aberrant Aging: A Supervised Machine Learning Investigation;Schizophrenia Bulletin;2024-09-09

4. A primer and practical recommendations for AI and machine learning terminology in medicine and behavioral sciences (Preprint);2024-09-03

5. Prediction of inhibitor development in previously untreated and minimally treated children with severe and moderately severe hemophilia A using a machine-learning network;Journal of Thrombosis and Haemostasis;2024-09