Affiliation:
1. Institute for Animal Breeding and Genetics, University of Veterinary Medicine Hannover, D-30559 Hannover, Germany
Abstract
Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence, an either too weak or a too optimistic accuracy is then reported and the estimated model performance cannot be reproduced on independent data. It is then also doubtful whether a classifier qualifies for clinical usage. We estimate classifier performances in simulated gene expression data with artificial outliers and in two real-world datasets. As a new approach, we use two outlier detection methods within a bootstrap procedure to estimate the outlier probability for each sample and evaluate classifiers before and after outlier removal by means of cross-validation. We found that the removal of outliers changed the classification performance notably. For the most part, removing outliers improved the classification results. Taking into account the fact that there are various, sometimes unclear reasons for a sample to be an outlier, we strongly advocate to always report the performance of a transcriptomics classifier with and without outliers in training and test data. This provides a more diverse picture of a classifier’s performance and prevents reporting models that later turn out to be not applicable for clinical diagnoses.
Funder
Deutsche Forschungsgemeinschaft
Subject
Genetics (clinical),Genetics
Reference44 articles.
1. Overview of DNA microarrays: Types, applications, and their future;Bumgarner;Curr. Protoc. Mol. Biol.,2013
2. RNA-seq: From technology to biology;Marguerat;Cell. Mol. Life Sci.,2010
3. Machine learning methods applied to DNA microarray data can improve the diagnosis of cancer;Bair;ACM SIGKDD Explor. Newsl.,2003
4. Comparison of RNA-seq and microarray-based models for clinical endpoint prediction;Zhang;Genome Biol.,2015
5. Huang, Z., Johnson, T.S., Han, Z., Helm, B., Cao, S., Zhang, C., Salama, P., Rizkalla, M., Yu, C.Y., and Cheng, J. (2020). Deep learning-based cancer survival prognosis from RNA-seq data: Approaches and evaluations. BMC Med. Genom., 13.