Author:
Baker Brennan H.,Sathyanarayana Sheela,Szpiro Adam A.,MacDonald James W.,Paquette Alison G.
Abstract
Abstract
Missing covariate data is a common problem that has not been addressed in observational studies of gene expression. Here, we present a multiple imputation method that accommodates high dimensional gene expression data by incorporating principal component analysis of the transcriptome into the multiple imputation prediction models to avoid bias. Simulation studies using three datasets show that this method outperforms complete case and single imputation analyses at uncovering true positive differentially expressed genes, limiting false discovery rates, and minimizing bias. This method is easily implemented via an R Bioconductor package, RNAseqCovarImpute that integrates with the limma-voom pipeline for differential expression analysis.
Publisher
Springer Science and Business Media LLC
Reference55 articles.
1. van Buuren S. Flexible Imputation of Missing Data. Second Edition (2nd ed.). Chapman and Hall/CRC; 2018. https://doi.org/10.1201/9780429492259.
2. Rubin DB. Multiple Imputation for Nonresponse in Surveys. New York: Wiley; 2004.
3. Heymans M, Eekhout I. Applied missing data analysis with SPSS and (R) Studio. Heymans and Eekhout: Amsterdam, The Netherlands: 20 Available online: https://bookdown/org/mwheymans/bookmi/. 2019. Accessed 23 May 2020.
4. Goodwin S, McPherson JD, McCombie WR. Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet. 2016;17(6):333–51.
5. LeWinn KZ, Karr CJ, Hazlehurst M, Carroll K, Loftus C, Nguyen R, et al. Cohort profile: the ECHO prenatal and early childhood pathways to health consortium (ECHO-PATHWAYS). BMJ Open. 2022;12(10):e064288.