Affiliation:
1. University of Pisa, Italy
2. ISTI-CNR, Pisa, Italy
Abstract
The
Permuterm index
[Garfield 1976] is a time-efficient and elegant solution to the string dictionary problem in which pattern queries may possibly include one wild-card symbol (called
Tolerant Retrieval
problem). Unfortunately the Permuterm index is space inefficient because it quadruples the dictionary size. In this article we propose the
Compressed
Permuterm Index which solves the Tolerant Retrieval problem in time proportional to the length of the searched pattern, and space close to the
k
th order empirical entropy of the indexed dictionary. We also design a
dynamic
version of this index that allows to efficiently manage insertion in, and deletion from, the dictionary of individual strings.
The result is based on a simple variant of the Burrows-Wheeler Transform, defined on a dictionary of strings of variable length, that allows to efficiently solve the Tolerant Retrieval problem via known (dynamic) compressed indexes [Navarro and Mäkinen 2007]. We will complement our theoretical study with a significant set of experiments that show that the Compressed Permuterm Index supports fast queries within a space occupancy that is close to the one achievable by compressing the string dictionary via gzip or bzip. This improves known approaches based on Front-Coding [Witten et al. 1999] by more than 50% in absolute space occupancy, still guaranteeing comparable query time.
Publisher
Association for Computing Machinery (ACM)
Subject
Mathematics (miscellaneous)
Reference22 articles.
1. Fast text searching for regular expressions or automaton searching on tries
2. Baeza-Yates R. and Ribeiro-Neto B. 1999. Modern Information Retrieval. ACM/Addison-Wesley. Baeza-Yates R. and Ribeiro-Neto B. 1999. Modern Information Retrieval. ACM/Addison-Wesley.
3. UbiCrawler: a scalable fully distributed Web crawler
Cited by
39 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献