Word-based self-indexes for natural language text-Reference-Cited by-同舟云学术

Word-based self-indexes for natural language text

Published:2012-02 Issue:1 Volume:30 Page:1-34
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Fariña Antonio¹,Brisaboa Nieves R.¹,Navarro Gonzalo²,Claude Francisco³,Places Ángeles S.¹,Rodríguez Eduardo¹

Affiliation:

1. University of A Coruña, Spain

2. University of Chile, Chile

3. University of Waterloo, Canada

Abstract

The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.

Funder

Ministerio de Ciencia e Innovación

Fondo Nacional de Desarrollo Científico y Tecnológico

Xunta de Galicia

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/2094072.2094073

Reference64 articles.

1. NATO ISI Series;Apostolico A.

2. A Fast Set Intersection Algorithm for Sorted Sequences

3. Baeza-Yates R. and Navarro G. 2004. Modeling text databases. In Recent Advances in Applied Probability Springer 1--25. Baeza-Yates R. and Navarro G. 2004. Modeling text databases. In Recent Advances in Applied Probability Springer 1--25.

Cited by 38 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. TRGST: An enhanced generalized suffix tree for topological relations between paths;Information Systems;2024-11

2. On Entropy and Source Encoding of Written Language: A South Slavic Example;2023 31st Telecommunications Forum (TELFOR);2023-11-21

3. Compressed and queryable self-indexes for RDF archives;Knowledge and Information Systems;2023-08-29

4. Space/time-efficient RDF stores based on circular suffix sorting;The Journal of Supercomputing;2022-10-25

5. Join optimization for inverted index technique on relational database management systems;Expert Systems with Applications;2022-07