An intelligent extension of the training set for the Persian n-gram language model: an enrichment algorithm
-
Published:2023
Issue:61
Volume:
Page:191-211
-
ISSN:
-
Container-title:Onomázein Revista de lingüística filología y traducción
-
language:
-
Short-container-title:Onomázein
Author:
Motavallian RezvanORCID,
Komeily Masoud
Abstract
In this article, we are going to introduce an automatic mechanism to intelligently extend the training set to improve the n-gram language model of Persian. Given the free word-order property in Persian, our enrichment algorithm diversifies n-gram combinations in baseline training data through dependency reordering, adding permissible sentences and filtering ungrammatical sentences using a hybrid empirical (heuristic) and linguistic approach. Experiments performed on baseline training set (taken from a standard Persian corpus) and the resulting enriched training set indicate a declining trend in average relative perplexity (between 34% to 73%) for informal/spoken vs. formal/written Persian test data.
Publisher
Pontificia Universidad Catolica de Chile
Subject
General Medicine,General Earth and Planetary Sciences,General Environmental Science,General Medicine,Ocean Engineering,General Medicine,General Medicine,General Medicine,General Medicine,General Earth and Planetary Sciences,General Environmental Science,General Medicine