Fast matrix completion in epigenetic methylation studies with informative covariates-Reference-Cited by-同舟云学术

Fast matrix completion in epigenetic methylation studies with informative covariates

Published:2024-06-07 Issue: Volume: Page:
ISSN:1465-4644
Container-title:Biostatistics
language:en
Short-container-title:

Author:

Ribaud Mélina¹,Labbe Aurélie¹,Fouda Khaled¹,Oualkacha Karim²^ORCID

Affiliation:

1. Department of Decision Science, HEC Montreal , 3000 chemin de la Cote Ste Catherine Montréal , QC H3T 2A7 Montreal, Canada

2. Department of Mathematics, Université du Québec à Montreal, 201, Ave Président-Kennedy Montreal (QC) , H2X 3Y7 Montreal, Canada

Abstract

Abstract DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows—which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.

Funder

Canadian Statistical Institute

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/biostatistics/advance-article-pdf/doi/10.1093/biostatistics/kxae016/58914405/kxae016.pdf

Reference25 articles.

1. Accounting for population stratification in dna methylation studies;Barfield;Genet Epidemiol.,2014

2. A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation;Chen;Transport Res C Emerg Technol,2019

3. Methylation data imputation performances under different representations and missingness patterns;Di Lena;BMC Bioinformatics.,2020

4. Goblet cell and mucin gene abnormalities in asthma;Fahy;Chest.,2002

5. Gaussian orthogonal latent factor processes for large incomplete matrices of correlated data;Gu;Bayesian Anal.,2022