Efficient Computation of Sequence Mappability
-
Published:2022-02-02
Issue:5
Volume:84
Page:1418-1440
-
ISSN:0178-4617
-
Container-title:Algorithmica
-
language:en
-
Short-container-title:Algorithmica
Author:
Charalampopoulos Panagiotis, Iliopoulos Costas S., Kociumaka Tomasz, Pissis Solon P., Radoszewski JakubORCID, Straszyński JuliuszORCID
Abstract
AbstractSequence mappability is an important task in genome resequencing. In the (k, m)-mappability problem, for a given sequence T of length n, the goal is to compute a table whose ith entry is the number of indices $$j \ne i$$
j
≠
i
such that the length-m substrings of T starting at positions i and j have at most k mismatches. Previous works on this problem focused on heuristics computing a rough approximation of the result or on the case of $$k=1$$
k
=
1
. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that, for $$k=O(1)$$
k
=
O
(
1
)
, works in $$O(n)$$
O
(
n
)
space and, with high probability, in $$O(n \cdot \min \{m^k,\log ^k n\})$$
O
(
n
·
min
{
m
k
,
log
k
n
}
)
time. Our algorithm requires a careful adaptation of the k-errata trees of Cole et al. [STOC 2004] to avoid multiple counting of pairs of substrings. Our technique can also be applied to solve the all-pairs Hamming distance problem introduced by Crochemore et al. [WABI 2017]. We further develop $$O(n^2)$$
O
(
n
2
)
-time algorithms to compute all (k, m)-mappability tables for a fixed m and all $$k\in \{0,\ldots ,m\}$$
k
∈
{
0
,
…
,
m
}
or a fixed k and all $$m\in \{k,\ldots ,n\}$$
m
∈
{
k
,
…
,
n
}
. Finally, we show that, for $$k,m = \Theta (\log n)$$
k
,
m
=
Θ
(
log
n
)
, the (k, m)-mappability problem cannot be solved in strongly subquadratic time unless the Strong Exponential Time Hypothesis fails. This is an improved and extended version of a paper presented at SPIRE 2018.
Funder
Fundacja na rzecz Nauki Polskiej Horizon 2020 Israel Science Foundation National Science Foundation Alfred P. Sloan Foundation Narodowe Centrum Nauki
Publisher
Springer Science and Business Media LLC
Subject
Applied Mathematics,Computer Science Applications,General Computer Science
Reference35 articles.
1. Alamro, H., Ayad, L.A.K., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P.: Longest common prefixes with $$k$$-mismatches and applications. In: Tjoa, A.M., Bellatreche, L., Biffl, S., van Leeuwen, J., Wiedermann, J. (eds.) 44th International Conference on Current Trends in Theory and Practice of Computer Science, SOFSEM 2018, LNCS, vol. 10706, pp. 636–649. Springer (2018). https://doi.org/10.1007/978-3-319-73117-9_45 2. Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Kociumaka, T., Pissis, S.P., Radoszewski, J., Straszyński, J.: Efficient computation of sequence mappability. In: Gagie, T., Moffat, A., Navarro, G., Cuadros-Vargas, E. (eds.) 25th International Symposium on String Processing and Information Retrieval, SPIRE 2018, LNCS, vol. 11147, pp. 12–26. Springer (2018). https://doi.org/10.1007/978-3-030-00479-8_2 3. Alzamel, M., Charalampopoulos, P., Iliopoulos, C.S., Pissis, S.P., Radoszewski, J., Sung, W.: Faster algorithms for 1-mappability of a sequence. Theor. Comput. Sci. 812, 2–12 (2020). https://doi.org/10.1016/j.tcs.2019.04.026 4. Amir, A., Boneh, I., Kondratovsky, E.: The k-mappability problem revisited. In: Gawrychowski, P., Starikovskaya, T. (eds.) 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021, LIPIcs, vol. 191, pp. 5:1–5:20. Schloss Dagstuhl–Leibniz-Zentrum für Informatik (2021). https://doi.org/10.4230/LIPIcs.CPM.2021.5 5. Antoniou, P., Daykin, J.W., Iliopoulos, C.S., Kourie, D., Mouchard, L., Pissis, S.P.: Mapping uniquely occurring short sequences derived from high throughput technologies to a reference genome. In: 9th International Conference on Information Technology and Applications in Biomedicine, ITAB 2009, pp. 1–4. IEEE (2009). https://doi.org/10.1109/itab.2009.5394394
|
|