Handling Massive N -Gram Datasets Efficiently-Reference-Cited by-同舟云学术

Handling Massive N -Gram Datasets Efficiently

Published:2019-04-30 Issue:2 Volume:37 Page:1-41
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Pibiri Giulio Ermanno¹^ORCID,Venturini Rossano¹

Affiliation:

1. University of Pisa and ISTI-CNR, Pisa, Italy

Abstract

Two fundamental problems concern the handling of large n -gram language models: indexing , that is, compressing the n -grams and associated satellite values without compromising their retrieval speed, and estimation , that is, computing the probability distribution of the n -grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n -gram following a context of fixed length k , that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n -gram strings. With an extensive experimental analysis performed on billions of n -grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.

Funder

European Union's Horizon 2020 research and innovation program under the Information and Communication Technologies program

PEGASO project

BIGDATAGRAPES project

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3302913

Reference56 articles.

1. Context-sensitive query auto-completion

2. Cache-Oblivious Peeling of Random Hypergraphs

3. Space/time trade-offs in hash coding with allowable errors

Cited by 21 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improving the training and application of Chinese language models for low-order grammar;International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024);2024-06-13

2. On weighted k-mer dictionaries;Algorithms for Molecular Biology;2023-06-17

3. Locality-preserving minimal perfect hashing of k-mers;Bioinformatics;2023-06-01

4. Deep learning from physicochemical information of concrete with an artificial language for property prediction and reaction discovery;Resources, Conservation and Recycling;2023-03

5. Engineering faster double‐array Aho–Corasick automata;Software: Practice and Experience;2023-02-21