Handling Massive N -Gram Datasets Efficiently

Author:

Pibiri Giulio Ermanno1ORCID,Venturini Rossano1

Affiliation:

1. University of Pisa and ISTI-CNR, Pisa, Italy

Abstract

Two fundamental problems concern the handling of large n -gram language models: indexing , that is, compressing the n -grams and associated satellite values without compromising their retrieval speed, and estimation , that is, computing the probability distribution of the n -grams extracted from a large textual source. Performing these two tasks efficiently is vital for several applications in the fields of Information Retrieval, Natural Language Processing, and Machine Learning, such as auto-completion in search engines and machine translation. Regarding the problem of indexing, we describe compressed, exact, and lossless data structures that simultaneously achieve high space reductions and no time degradation with respect to the state-of-the-art solutions and related software packages. In particular, we present a compressed trie data structure in which each word of an n -gram following a context of fixed length k , that is, its preceding k words, is encoded as an integer whose value is proportional to the number of words that follow such context. Since the number of words following a given context is typically very small in natural languages, we lower the space of representation to compression levels that were never achieved before, allowing the indexing of billions of strings. Despite the significant savings in space, our technique introduces a negligible penalty at query time. Specifically, the most space-efficient competitors in the literature, which are both quantized and lossy, do not take less than our trie data structure and are up to 5 times slower. Conversely, our trie is as fast as the fastest competitor but also retains an advantage of up to 65% in absolute space. Regarding the problem of estimation, we present a novel algorithm for estimating modified Kneser-Ney language models that have emerged as the de-facto choice for language modeling in both academia and industry thanks to their relatively low perplexity performance. Estimating such models from large textual sources poses the challenge of devising algorithms that make a parsimonious use of the disk. The state-of-the-art algorithm uses three sorting steps in external memory: we show an improved construction that requires only one sorting step by exploiting the properties of the extracted n -gram strings. With an extensive experimental analysis performed on billions of n -grams, we show an average improvement of 4.5 times on the total runtime of the previous approach.

Funder

European Union's Horizon 2020 research and innovation program under the Information and Communication Technologies program

PEGASO project

BIGDATAGRAPES project

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Cited by 21 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Improving the training and application of Chinese language models for low-order grammar;International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2024);2024-06-13

2. On weighted k-mer dictionaries;Algorithms for Molecular Biology;2023-06-17

3. Locality-preserving minimal perfect hashing of k-mers;Bioinformatics;2023-06-01

4. Deep learning from physicochemical information of concrete with an artificial language for property prediction and reaction discovery;Resources, Conservation and Recycling;2023-03

5. Engineering faster double‐array Aho–Corasick automata;Software: Practice and Experience;2023-02-21

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3