Affiliation:
1. Università di Pisa, Pisa, Italy
2. Università del Piemonte Orientale, Alessandria, Italy
3. University of Helsinki, Finland
4. University of Chile, santiago, chile
Abstract
Given a sequence
S
=
s
1
s
2
…
s
n
of integers smaller than
r
=
O
(polylog(
n
)), we show how
S
can be represented using
nH
0
(
S
) +
o
(
n
) bits, so that we can know any
s
q
, as well as answer
rank
and
select
queries on
S
, in constant time.
H
0
(
S
) is the zero-order empirical entropy of
S
and
nH
0
(
S
) provides an information-theoretic lower bound to the bit storage of any sequence
S
via a fixed encoding of its symbols. This extends previous results on binary sequences, and improves previous results on general sequences where those queries are answered in
O
(log
r
) time. For larger
r
, we can still represent
S
in
nH
0
(
S
) +
o
(
n
log
r
) bits and answer queries in
O
(log
r
/log log
n
) time.
Another contribution of this article is to show how to combine our compressed representation of integer sequences with a compression boosting technique to design
compressed full-text indexes
that scale well with the size of the input alphabet Σ. Specifically, we design a variant of the FM-index that indexes a string
T
[1,
n
] within
nH
k
(
T
) +
o
(
n
) bits of storage, where
H
k
(
T
) is the
k
th-order empirical entropy of
T
. This space bound holds simultaneously for all
k
≤ α log
|Σ|
n
, constant 0 < α < 1, and |Σ| =
O
(polylog(
n
)). This index counts the occurrences of an arbitrary pattern
P
[1,
p
] as a substring of
T
in
O
(
p
) time; it locates each pattern occurrence in
O
(log
1+ε
n
) time for any constant 0 < ε < 1; and reports a text substring of length ℓ in
O
(ℓ + log
1+ε
n
) time.
Compared to all previous works, our index is the first that removes the alphabet-size dependance from all query times, in particular, counting time is linear in the pattern length. Still, our index uses essentially the same space of the
k
th-order entropy of the text
T
, which is the best space obtained in previous work. We can also handle larger alphabets of size |Σ| =
O
(
n
β
), for any 0 < β < 1, by paying
o
(
n
log|Σ|) extra space and multiplying all query times by
O
(log |Σ|/log log
n
).
Publisher
Association for Computing Machinery (ACM)
Subject
Mathematics (miscellaneous)
Reference38 articles.
1. Crochemore M. and Rytter W. 1994. Text Algorithms. Oxford University Press. Crochemore M. and Rytter W. 1994. Text Algorithms. Oxford University Press.
2. A linear lower bound on index size for text retrieval
3. Boosting textual compression in optimal linear time
Cited by
247 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献