Separating and reintegrating latent variables to improve classification of genomic data-Reference-Cited by-同舟云学术

Separating and reintegrating latent variables to improve classification of genomic data

Published:2022-01-30 Issue: Volume: Page:
ISSN:1465-4644
Container-title:Biostatistics
language:en
Short-container-title:

Author:

Payne Nora Yujia¹^ORCID,Gagnon-Bartsch Johann A¹

Affiliation:

1. Department of Statistics, University of Michigan, 1085 S. University Ave., Ann Arbor, MI 48109, USA

Abstract

Summary Genomic data sets contain the effects of various unobserved biological variables in addition to the variable of primary interest. These latent variables often affect a large number of features (e.g., genes), giving rise to dense latent variation. This latent variation presents both challenges and opportunities for classification. While some of these latent variables may be partially correlated with the phenotype of interest and thus helpful, others may be uncorrelated and merely contribute additional noise. Moreover, whether potentially helpful or not, these latent variables may obscure weaker effects that impact only a small number of features but more directly capture the signal of primary interest. To address these challenges, we propose the cross-residualization classifier (CRC). Through an adjustment and ensemble procedure, the CRC estimates and residualizes out the latent variation, trains a classifier on the residuals, and then reintegrates the latent variation in a final ensemble classifier. Thus, the latent variables are accounted for without discarding any potentially predictive information. We apply the method to simulated data and a variety of genomic data sets from multiple platforms. In general, we find that the CRC performs well relative to existing classifiers and sometimes offers substantial gains.

Funder

National Science Foundation Graduate Research Fellowship

National Science Foundation RTG

Publisher

Oxford University Press (OUP)

Subject

Statistics, Probability and Uncertainty,General Medicine,Statistics and Probability

Link

https://academic.oup.com/biostatistics/advance-article-pdf/doi/10.1093/biostatistics/kxab046/42334197/kxab046.pdf

Reference37 articles.

1. Some theory for Fisher’s linear discriminant function, ‘naive Bayes’, and some alternatives when there are many more variables than observations;Bickel,;Bernoulli,2004

2. Air pollution and gene-specific methylation in the Normative Aging Study: Association, effect modification, and mediation analysis;Bind,;Epigenetics,2014

3. An expanded view of complex traits: From polygenic to omnigenic;Boyle,;Cell,2017

4. Selecting the number of principal components: Estimation of the true rank of a noisy matrix;Choi,;Annals of Statistics,2017

5. Estimating sufficient reductions of the predictors in abundant high-dimensional regressions;Cook,;Annals of Statistics,2012