Morphological Tagging and Lemmatization in the Albanian Language

Author:

Mati Diellza Nagavci1,Hamiti Mentor1,Mollakuqe Elissa2

Affiliation:

1. Faculty of Contemporary Sciences and Technologies , South East European University , Tetovo , Republic of North Macedonia

2. Faculty of Information Sciences and Computer Engineering , University Ss. Cyril and Methodius , Skopje , Republic of North Macedonia

Abstract

Abstract An important element of Natural Language Processing is parts of speech tagging. With fine-grained word-class annotations, the word forms in a text can be enhanced and can also be used in downstream processes, such as dependency parsing. The improved search options that tagged data offers also greatly benefit linguists and lexicographers. Natural language processing research is becoming increasingly popular and important as unsupervised learning methods are developed. There are some aspects of the Albanian language that make the creation of a part-of-speech tag set challenging. This research provides a discussion of those issues linguistic phenomena and presents a proposal for a part-of-speech tag set that can adequately represent them. The corpus contains more than 250,000 tokens, each annotated with a medium-sized tag set. The Albanian language’s syntagmatic aspects are adequately represented. Additionally, in this paper are morphologically and part-of-speech tagged corpora for the Albanian language, as well as lemmatize and neural morphological tagger trained on these corpora. Based on the held-out evaluation set, the model achieves 93.65% accuracy on part-of-speech tagging, The morphological tagging rate was 85.31 % and the lemmatization rate was 88.95%. Furthermore, the TF-IDF technique weighs terms and with the scores are highlighted words that have additional information for the Albanian corpus.

Publisher

Walter de Gruyter GmbH

Reference14 articles.

1. 1. Balakrishnan, V., & Lloyd-Yemoh, E. (2014). Stemming and Lemmatization: A Comparison of Retrieval Performances. Lecture Notes on Software Engineering, 262-267.10.7763/LNSE.2014.V2.134

2. 2. Hasanaj, B. (2012). A Part of Speech Tagging Model for Albanian. Saarbrücken: LAP Lambert Academic Publishing.

3. 3. Kabashi, B., & Proisl, T. (2016). A proposal for a part-of-speech tagset for the Albanian language. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16) (pp. 4305-4310). Portorož, Slovenia: The International Conference on Language Resources and Evaluation.

4. 4. Kadriu, A. (2013). NLTK tagger for Albanian using iteraIterative Approach. Proceedings of the 35th Internationa Conference on Information Technology Interfaces (ITI).

5. 5. Kote, N., Biba, M., Kanerva, J., Rönnqvist, S., & Ginter, F. (2019). Morphological Tagging and Lemmatization of Albanian: A Manually Annotated Corpus and Neural Models. Computation and Language (cs.CL), 50-62.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3