How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts-Reference-Cited by-同舟云学术

How to tag non-standard language: Normalisation versus domain adaptation for Slovene historical and user-generated texts

Published:2019-09 Issue:5 Volume:25 Page:651-674
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

Zupan Katja,Ljubešić Nikola^ORCID,Erjavec Tomaž^ORCID

Abstract

AbstractPart-of-speech (PoS) tagging of non-standard language with models developed for standard language is known to suffer from a significant decrease in accuracy. Two methods are typically used to improve it: word normalisation, which decreases the out-of-vocabulary rate of the PoS tagger, and domain adaptation where the tagger is made aware of the non-standard language variation, either through supervision via non-standard data being added to the tagger’s training set, or via distributional information calculated from raw texts. This paper investigates the two approaches, normalisation and domain adaptation, on carefully constructed data sets encompassing historical and user-generated Slovene texts, in particular focusing on the amount of labour necessary to produce the manually annotated data sets for each approach and comparing the resulting PoS accuracy. We give quantitative as well as qualitative analyses of the tagger performance in various settings, showing that on our data set closed and open class words exhibit significantly different behaviours, and that even small inconsistencies in the PoS tags in the data have an impact on the accuracy. We also show that to improve tagging accuracy, it is best to concentrate on obtaining manually annotated normalisation training data for short annotation campaigns, while manually producing in-domain training sets for PoS tagging is better when a more substantial annotation campaign can be undertaken. Finally, unsupervised adaptation via Brown clustering is similarly useful regardless of the size of the training data available, but improvements tend to be bigger when adaptation is performed via in-domain tagging data.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Reference76 articles.

1. Bollmann, M. , Dipper, S. , Krasselt, J. , and Petran, F. (2012). Manual and semi-automatic normalization of historical spelling-case studies from Early New High German. In KONVENS, pp. 342–350.

2. JANES v0.4 : korpus slovenskih spletnih uporabniških vsebin (JANES 04: a corpus of Slovene User Generated Content;Fišer;Slovenščina 2.0,2016

3. TnT

4. Eisenstein, J. (2013). What to do about bad language on the Internet. In Proceedings of North American Chapter of the Association for Computational Linguistics (NAACL), pp. 359–369.

5. Rayson, P. , Archer, D. , Baron, A. , Culpeper, J. , and Smith, N. (2007). Tagging the Bard: Evaluating the accuracy of a modern POS tagger on Early Modern English corpora. In Proceedings of the Corpus Linguistics Conference: CL 2007. UCREL.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Large-scale Non-standard English Database and Transformer-based Translation System;2023 IEEE 22nd International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom);2023-11-01

2. Natural language processing for similar languages, varieties, and dialects: A survey;Natural Language Engineering;2020-11