Author:
Chen Shuai,Li Ziqi,Liu Long,Wen Yalu
Abstract
AbstractWhile the high-dimensional biological data have provided unprecedented data resources for the identification of biomarkers, consensus is still lacking on how to best analyze them. The recently developed Gaussian mirror (GM) and Model-X (MX) knockoff-based methods have much related model assumptions, which makes them appealing for the detection of new biomarkers. However, there are no guidelines for their practical use. In this research, we systematically compared the performance of MX-based and GM methods, where the impacts of the distribution of explanatory variables, their relatedness and the signal-to-noise ratio were evaluated. MX with knockoff generated using the second-order approximates (MX-SO) has the best performance as compared to other MX-based methods. MX-SO and GM have similar levels of power and computational speed under most of the simulations, but GM is more robust in the control of false discovery rate (FDR). In particular, MX-SO can only control the FDR well when there are weak correlations among explanatory variables and the sample size is at least moderate. On the contrary, GM can have the desired FDR as long as explanatory variables are not highly correlated. We further used GM and MX-based methods to detect biomarkers that are associated with the Alzheimer’s disease-related PET-imaging trait and the Parkinson’s disease-related T-tau of cerebrospinal fluid. We found that MX-based and GM methods are both powerful for the analysis of big biological data. Although genes selected from MX-based methods are more similar as compared to those from the GM method, both MX-based and GM methods can identify the well-known disease-associated genes for each disease. While MX-based methods can have a slightly higher power than that of the GM method, it is less robust, especially for data with small sample sizes, unknown distributions, and high correlations.
Funder
National Natural Science Foundation of China
Early Career Research Excellence Award from the University of Auckland, the Marsden Fund from Royal Society of New Zealand
Publisher
Springer Science and Business Media LLC
Reference52 articles.
1. Finotello, F. & Di Camillo, B. Measuring differential gene expression with RNA-seq: Challenges and strategies for data analysis. Brief. Funct. Genom. 14, 130–142 (2015).
2. Wang, Z., Gerstein, M. & Snyder, M. RNA-Seq: A revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009).
3. Kukurba, K. R. & Montgomery, S. B. RNA sequencing and analysis. Cold Spring Harb. Protoc. 2015, pdb. top084970 (2015).
4. Bonferroni, C. E. Il calcolo delle assicurazioni su gruppi di teste. J. Studi in onore del professore salvatore ortu carboni. 13–60 (1935).
5. Bonferroni, C. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze 8, 3–62 (1936).