Author:
Afanasev Ilia,Lyashevskaya Olga,Rebrikov Stefan,Shishkina Yana,Trofimov Igor,Vlasova Natalia
Abstract
Abstract
The need to develop tools for historical and regional variations is becoming more urgent in natural language processing. In this paper, we present two candidate systems for lemmatising historical East Slavic lects (Late Old East Slavic and Middle Russian), as well as modern regional East Slavic lects (Belogornoje and Megra): BERT-based end-to-end pipeline with language-specific heuristics and sequence-to-sequence BART-based encoderdecoder. To evaluate their predictions, we use accuracy score and string similarity measures, such as Levenshtein distance. The BERT-based model is more suitable for the regional data, achieving 85% accuracy score, and only 74% on the historical data. BART-based model climbs up to 92.6% accuracy score on the historical data, yet gets only 80% on the regional data. We provide an error analysis and discuss ways to enhance models, such as dictionary lookup and spellchecker.
Subject
Linguistics and Language,Language and Linguistics,Linguistics and Language,Language and Linguistics
Reference34 articles.
1. Anastasyev, D. (2020). Exploring pretrained models for joint morphosyntactic parsing of Russian. In Computational Linguistics and Intellectual Technologies: Papers from the Annual Conference “Dialogue”, 19, pages 1–12, Moscow, Russia.
2. Ankhimiuk, U. V. (2000). Soligalicheskije akty iz “Arkhiva Volynskikh”. In A. V. Antonov (ed.): Russian Diplomatary. Moscow: Archeographical center, pages 25–42.
3. Berdičevskis, A., Eckhoff, H., and Gavrilova, T. (2016). The beginning of a beautiful friendship: rule-based and statistical analysis of Middle Russian. In Komp’yuternaya lingvistika i intellektual’nye tekhnologii. Trudy mezhdunarodnoj konferencii «Dialog», pages 99–111, Moscow, Russia. RSSU.
4. Bergmanis, T., and Goldwater, S. (2018). Context sensitive neural lemmatization with Lematus. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1391–1400, New Orleans, Louisiana. Association for Computational Linguistics.
5. Cherepnin, L. V. (1961). Akty feodal’nogo zemlievladenija i khozyajstwa XIV – XVI vekov (in 3 volumes). Moscow: USSR Academy of Sciences.
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献