Efficient Analysis of Annotation Colocalization Accounting for Genomic Contexts

Author:

Gafurov Askar,Vinař TomášORCID,Medvedev Paul,Brejová BroňaORCID

Abstract

AbstractAn annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. Previous approaches to this problem remain too slow or inaccurate.To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistics and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models.We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.AvailabilityThe software is freely available athttps://github.com/fmfi-compbio/mcdp2_underthe MIT licence. All data for reproducibility are available athttps://github.com/fmfi-compbio/mcdp2-reproducibility

Publisher

Cold Spring Harbor Laboratory

Reference22 articles.

1. “Prediction of complete gene structures in human genomic DNA;In: Journal of Molecular Biology,1997

2. Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis

3. “GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets;In: Bioinformatics,2016

4. Richard Durbin , Sean R Eddy , Anders Krogh , and Graeme Mitchison . Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.

5. “Markov chains improve the significance computation of overlapping genome annotations;In: Bioinformatics,2022

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3