Robust principal component analysis for accurate outlier sample detection in RNA-Seq data-Reference-Cited by-同舟云学术

Robust principal component analysis for accurate outlier sample detection in RNA-Seq data

Published:2020-06-29 Issue:1 Volume:21 Page:
ISSN:1471-2105
Container-title:BMC Bioinformatics
language:en
Short-container-title:BMC Bioinformatics

Author:

Chen Xiaoying,Zhang Bo,Wang Ting,Bonni Azad,Zhao Guoyan^ORCID

Abstract

Abstract Background High throughput RNA sequencing is a powerful approach to study gene expression. Due to the complex multiple-steps protocols in data acquisition, extreme deviation of a sample from samples of the same treatment group may occur due to technical variation or true biological differences. The high-dimensionality of the data with few biological replicates make it challenging to accurately detect those samples, and this issue is not well studied in the literature currently. Robust statistics is a family of theories and techniques aim to detect the outliers by first fitting the majority of the data and then flagging data points that deviate from it. Robust statistics have been widely used in multivariate data analysis for outlier detection in chemometrics and engineering. Here we apply robust statistics on RNA-seq data analysis. Results We report the use of two robust principal component analysis (rPCA) methods, PcaHubert and PcaGrid, to detect outlier samples in multiple simulated and real biological RNA-seq data sets with positive control outlier samples. PcaGrid achieved 100% sensitivity and 100% specificity in all the tests using positive control outliers with varying degrees of divergence. We applied rPCA methods and classical principal component analysis (cPCA) on an RNA-Seq data set profiling gene expression of the external granule layer in the cerebellum of control and conditional SnoN knockout mice. Both rPCA methods detected the same two outlier samples but cPCA failed to detect any. We performed differentially expressed gene detection before and after outlier removal as well as with and without batch effect modeling. We validated gene expression changes using quantitative reverse transcription PCR and used the result as reference to compare the performance of eight different data analysis strategies. Removing outliers without batch effect modeling performed the best in term of detecting biologically relevant differentially expressed genes. Conclusions rPCA implemented in the PcaGrid function is an accurate and objective method to detect outlier samples. It is well suited for high-dimensional data with small sample sizes like RNA-seq data. Outlier removal can significantly improve the performance of differential gene detection and downstream functional analysis.

Funder

national institute of health

National Institute on Drug Abuse

Goldman Sachs Group

National Human Genome Research Institute

National Institute of Environmental Health Sciences

Publisher

Springer Science and Business Media LLC

Subject

Applied Mathematics,Computer Science Applications,Molecular Biology,Biochemistry,Structural Biology

Link

https://link.springer.com/content/pdf/10.1186/s12859-020-03608-0.pdf

Reference46 articles.

1. Moore DS, McCabe GP. Introduction to the practice of statistics. 3rd ed. New York: W. H. Freeman; 1999.

2. Rousseeuw PJ, Hubert M. Anomaly detection by robust statistics. WIREs: Data Mining Knowl Discovery. 2018;8(2):1–1.

3. Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5(7):621–8.

4. Conesa A, Madrigal P, Tarazona S, Gomez-Cabrero D, Cervera A, McPherson A, et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 2016;17:13.

5. Norton SS, Vaquero-Garcia J, Lahens NF, Grant GR, Barash Y. Outlier detection for improved differential splicing quantification from RNA-Seq experiments with replicates. Bioinformatics. 2018;34(9):1488–97.

Cited by 53 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Sex-specific effects of early life stress exposure on memory performance and medial prefrontal cortex transcriptomic pattern of adolescent mice;2024-09-02

2. Artificial intelligence in metabolomics: a current review;TrAC Trends in Analytical Chemistry;2024-09

3. Insights into the differential proteome landscape of a newly isolated Paramecium multimicronucleatum in response to cadmium stress;Journal of Proteomics;2024-05

4. Glis2 is an early effector of polycystin signaling and a target for therapy in polycystic kidney disease;Nature Communications;2024-05-01

5. An Application of Robust Principal Component Analysis Methods for Anomaly Detection;Turkish Journal of Science and Technology;2024-03-28