Fast matrix completion in epigenetic methylation studies with informative covariates

Author:

Ribaud Mélina1,Labbe Aurélie1,Fouda Khaled1,Oualkacha Karim2ORCID

Affiliation:

1. Department of Decision Science, HEC Montreal , 3000 chemin de la Cote Ste Catherine Montréal , QC H3T 2A7 Montreal, Canada

2. Department of Mathematics, Université du Québec à Montreal, 201, Ave Président-Kennedy Montreal (QC) , H2X 3Y7 Montreal, Canada

Abstract

Abstract DNA methylation is an important epigenetic mark that modulates gene expression through the inhibition of transcriptional proteins binding to DNA. As in many other omics experiments, the issue of missing values is an important one, and appropriate imputation techniques are important in avoiding an unnecessary sample size reduction as well as to optimally leverage the information collected. We consider the case where relatively few samples are processed via an expensive high-density whole genome bisulfite sequencing (WGBS) strategy and a larger number of samples is processed using more affordable low-density, array-based technologies. In such cases, one can impute the low-coverage (array-based) methylation data using the high-density information provided by the WGBS samples. In this paper, we propose an efficient Linear Model of Coregionalisation with informative Covariates (LMCC) to predict missing values based on observed values and covariates. Our model assumes that at each site, the methylation vector of all samples is linked to the set of fixed factors (covariates) and a set of latent factors. Furthermore, we exploit the functional nature of the data and the spatial correlation across sites by assuming some Gaussian processes on the fixed and latent coefficient vectors, respectively. Our simulations show that the use of covariates can significantly improve the accuracy of imputed values, especially in cases where missing data contain some relevant information about the explanatory variable. We also showed that our proposed model is particularly efficient when the number of columns is much greater than the number of rows—which is usually the case in methylation data analysis. Finally, we apply and compare our proposed method with alternative approaches on two real methylation datasets, showing how covariates such as cell type, tissue type or age can enhance the accuracy of imputed values.

Funder

Canadian Statistical Institute

Publisher

Oxford University Press (OUP)

Reference25 articles.

1. Accounting for population stratification in dna methylation studies;Barfield;Genet Epidemiol.,2014

2. A Bayesian tensor decomposition approach for spatiotemporal traffic data imputation;Chen;Transport Res C Emerg Technol,2019

3. Methylation data imputation performances under different representations and missingness patterns;Di Lena;BMC Bioinformatics.,2020

4. Goblet cell and mucin gene abnormalities in asthma;Fahy;Chest.,2002

5. Gaussian orthogonal latent factor processes for large incomplete matrices of correlated data;Gu;Bayesian Anal.,2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3