Abstract
AbstractAn annotation is a set of genomic intervals sharing a particular function or property. Examples include genes, conserved elements, and epigenetic modifications. A common task is to compare two annotations to determine if one is enriched or depleted in the regions covered by the other. We study the problem of assigning statistical significance to such a comparison based on a null model representing two random unrelated annotations. Previous approaches to this problem remain too slow or inaccurate.To incorporate more background information into such analyses and avoid biased results, we propose a new null model based on a Markov chain which differentiates among several genomic contexts. These contexts can capture various confounding factors, such as GC content or sequencing gaps. We then develop a new algorithm for estimating p-values by computing the exact expectation and variance of the test statistics and then estimating the p-value using a normal approximation. Compared to the previous algorithm by Gafurov et al., the new algorithm provides three advances: (1) the running time is improved from quadratic to linear or quasi-linear, (2) the algorithm can handle two different test statistics, and (3) the algorithm can handle both simple and context-dependent Markov chain null models.We demonstrate the efficiency and accuracy of our algorithm on synthetic and real data sets, including the recent human telomere-to-telomere assembly. In particular, our algorithm computed p-values for 450 pairs of human genome annotations using 24 threads in under three hours. The use of genomic contexts to correct for GC-bias also resulted in the reversal of some previously published findings.AvailabilityThe software is freely available athttps://github.com/fmfi-compbio/mcdp2_underthe MIT licence. All data for reproducibility are available athttps://github.com/fmfi-compbio/mcdp2-reproducibility
Publisher
Cold Spring Harbor Laboratory
Reference22 articles.
1. “Prediction of complete gene structures in human genomic DNA;In: Journal of Molecular Biology,1997
2. Mind the gaps: overlooking inaccessible regions confounds statistical testing in genome analysis
3. “GenomeRunner web server: regulatory similarity and differences define the functional impact of SNP sets;In: Bioinformatics,2016
4. Richard Durbin , Sean R Eddy , Anders Krogh , and Graeme Mitchison . Biological sequence analysis: probabilistic models of proteins and nucleic acids. Cambridge University Press, 1998.
5. “Markov chains improve the significance computation of overlapping genome annotations;In: Bioinformatics,2022