Weighted finite-state transducers for normalization of historical texts-Reference-Cited by-同舟云学术

Weighted finite-state transducers for normalization of historical texts

Published:2019-03 Issue:2 Volume:25 Page:307-321
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Etxeberria Izaskun,Alegria Iñaki,Uria Larraitz

Abstract

AbstractThis paper presents a study about methods for normalization of historical texts. The aim of these methods is learning relations between historical and contemporary word forms. We have compiled training and test corpora for different languages and scenarios, and we have tried to read the results related to the features of the corpora and languages. Our proposed method, based on weighted finite-state transducers, is compared to previously published ones. Our method learns to map phonological changes using a noisy channel model; it is a simple solution that can use a limited amount of supervision in order to achieve adequate performance. The compiled corpora are ready to be used for other researchers in order to compare results. Concerning the amount of supervision for the task, we investigate how the size of training corpus affects the results and identify some interesting factors to anticipate the difficulty of the task.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference35 articles.

1. OpenFst: A General and Efficient Weighted Finite-State Transducer Library

2. Modernising historical Slovene words

Cited by 4 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Normalization of a Historic Western Ukrainian orthographic system Zhelekhivka in the Ukrainian Language Reference Corpus (GRAC);2023 IEEE 18th International Conference on Computer Science and Information Technologies (CSIT);2023-10-19

2. The first annotated corpus of historical Basque;Digital Scholarship in the Humanities;2021-10-19

3. Spelling Normalisation of Basque Historical Texts;PROCES LENG NAT;2019

4. How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts;Natural Language Engineering;2019-09