Low-Resource Active Learning of Morphological Segmentation
-
Published:2016-03-13
Issue:
Volume:4
Page:47-72
-
ISSN:2000-1533
-
Container-title:Northern European Journal of Language Technology
-
language:
-
Short-container-title:NEJLT
Author:
Grönroos Stig-Arne,Hiovain Katri,Smit Peter,Rauhala Ilona,Jokinen Kristiina,Kurimo Mikko,Virpioja Sami
Abstract
Many Uralic languages have a rich morphological structure, but lack morphological analysis tools needed for efficient language processing. While creating a high-quality morphological analyzer requires a significant amount of expert labor, data-driven approaches may provide sufficient quality for many applications. We study how to create a statistical model for morphological segmentation with a large unannotated corpus and a small amount of annotated word forms selected using an active learning approach. We apply the procedure to two Finno-Ugric languages: Finnish and North Sámi. The semi-supervised Morfessor FlatCat method is used for statistical learning. For Finnish, we set up a simulated scenario to test various active learning query strategies. The best performance is provided by a coverage-based strategy on word initial and final substrings. For North Sámi we collect a set of humanannotated data. With 300 words annotated with our active learning setup, we see a relative improvement in morph boundary F1-score of 19% compared to unsupervised learning and 7.8% compared to random selection.
Publisher
Linkoping University Electronic Press
Subject
General Materials Science
Reference56 articles.
1. Aikio, Ante. 2005. Pohjoissaamen alkeiskurssi. Lecture material.
2. Baum, Leonard E. 1972. An inequality and an associated maximization technique in statistical estimation of probabilistic functions of a Markov process. Inequalities 3(1):1–8.
3. Bosch, Sonja E, Laurette Pretorius, Kholisa Podile, and Axel Fleisch. 2008. Experimental fast-tracking of morphological analysers for Nguni languages. In LREC.
4. A Coefficient of Agreement for Nominal Scales
5. Morph-based speech recognition and modeling of out-of-vocabulary words across languages