Abstract
AbstractA word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$
M
{
y
1
,
…
,
y
k
}
ℓ
of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$
M
{
y
1
,
…
,
y
k
}
ℓ
contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set $\mathrm {M}^{\ell }_{y}$
M
y
ℓ
of minimal absent words of a word y is equal to $\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$
M
{
y
1
,
…
,
y
k
}
ℓ
for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available $\mathcal {O}(n)$
O
(
n
)
-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when $\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| =o(n)$
∥
M
{
y
1
,
…
,
y
N
}
ℓ
∥
=
o
(
n
)
, for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but $\| \mathrm {M}^{12}_{\{y_1,\ldots ,y_k\}}\| \approx 10^{6}$
∥
M
{
y
1
,
…
,
y
k
}
12
∥
≈
1
0
6
. We consider a constant-sized alphabet for stating our results. We show that all$\mathrm {M}^{\ell }_{y_{1}},\ldots ,\mathrm {M}^{\ell }_{\{y_1,\ldots ,y_k\}}$
M
y
1
ℓ
,
…
,
M
{
y
1
,
…
,
y
k
}
ℓ
can be computed in $\mathcal {O}(kn+{\sum }^{k}_{N=1}\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| )$
O
(
k
n
+
∑
N
=
1
k
∥
M
{
y
1
,
…
,
y
N
}
ℓ
∥
)
total time using $\mathcal {O}(\textsc {MaxIn}+\textsc {MaxOut})$
O
(
MaxIn
+
MaxOut
)
space, where MaxIn is the length of the longest word in {y1, … , yk} and $\textsc {MaxOut}=\max \limits \{\| \mathrm {M}^{\ell }_{\{y_1,\ldots ,y_N\}}\| :N\in [1,k]\}$
MaxOut
=
max
{
∥
M
{
y
1
,
…
,
y
N
}
ℓ
∥
:
N
∈
[
1
,
k
]
}
. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution.
Funder
Ministero dell’Istruzione, dell’Università e della Ricerca
Publisher
Springer Science and Business Media LLC
Subject
Computational Theory and Mathematics,Theoretical Computer Science