Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction-Reference-Cited by-同舟云学术

Language Modeling for Morphologically Rich Languages: Character-Aware Modeling for Word-Level Prediction

Published:2018-12 Issue: Volume:6 Page:451-465
ISSN:2307-387X
Container-title:Transactions of the Association for Computational Linguistics
language:en
Short-container-title:TACL

Author:

Gerz Daniela¹,Vulić Ivan¹,Ponti Edoardo¹,Naradowsky Jason²,Reichart Roi³,Korhonen Anna¹

Affiliation:

1. Language Technology Lab, DTAL, University of Cambridge,

2. Johns Hopkins University,

3. Faculty of Industrial Engineering and Management, Technion, IIT,

Abstract

Neural architectures are prominent in the construction of language models (LMs). However, word-level prediction is typically agnostic of subword-level information (characters and character sequences) and operates over a closed vocabulary, consisting of a limited word set. Indeed, while subword-aware models boost performance across a variety of NLP tasks, previous work did not evaluate the ability of these models to assist next-word prediction in language modeling tasks. Such subword-level informed models should be particularly effective for morphologically-rich languages (MRLs) that exhibit high type-to-token ratios. In this work, we present a large-scale LM study on 50 typologically diverse languages covering a wide variety of morphological systems, and offer new LM benchmarks to the community, while considering subword-level information. The main technical contribution of our work is a novel method for injecting subword-level information into semantic word vectors, integrated into the neural language modeling training, to facilitate word-level prediction. We conduct experiments in the LM setting where the number of infrequent words is large, and demonstrate strong perplexity gains across our 50 languages, especially for morphologically-rich languages. Our code and data sets are publicly available.

Publisher

MIT Press - Journals

Link

https://www.mitpressjournals.org/doi/pdf/10.1162/tacl_a_00032

Reference8 articles.

1. A bit of progress in language modeling

2. Long Short-Term Memory

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Improving short text classification with augmented data using GPT-3;Natural Language Engineering;2023-08-25

2. Pre-Training MLM Using Bert for the Albanian Language;SEEU Review;2023-06-01

3. A Simple and Effective Method for Injecting Word-level Information into Character-aware Neural Language Models;Journal of Natural Language Processing;2023

4. Comparison of text preprocessing methods;Natural Language Engineering;2022-06-13

5. Naive Bayesian Prediction of Japanese Annotated Corpus for Textual Semantic Word Formation Classification;Mathematical Problems in Engineering;2022-03-16