Affiliation:
1. University of Tsukuba, Tsukuba, Japan
Abstract
This paper proposes a method for Named-Entity Recognition (NER) for a low-resource language, Tigrinya, using a pre-trained language model. Tigrinya is a morphologically rich, although one of the underrepresented in the field of NLP. This is mainly due to the limited amount of annotated data available. To address this problem, we present the first publicly available datasets of NER for Tigrinya containing two versions, namely, (V1 and V2) annotated manually. The V1 and V2 datasets contain 69,309 and 40,627 tokens, respectively, where the annotations are based on the CoNLL 2003 Beginning, Inside, and Outside (BIO) tagging schema. Specifically, we develop a new pre-trained language model for Tigrinya based on RoBERTa, which we refer to as TigRoBERTa. Our model is then fine-tuned on down-stream tasks on a more specific target NER and POS tasks with limited data. Finally, we further enhance the model performance by applying semi-supervised self-training using unlabeled data. The experimental results show that the method achieved 84% F1-score for NER and 92% accuracy for POS tagging, which is better than or comparable to the baseline method based on the CNN-BiLSTM-CRF.
Publisher
Association for Computing Machinery (ACM)
Subject
Industrial and Manufacturing Engineering
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Optimizing Named Entity Recognition for Improving Logical Formulae Abstraction from Technical Requirements Documents;2023 10th International Conference on Dependable Systems and Their Applications (DSA);2023-08-10
2. Long Text Classification Using Pre-trained Language Model for a Low-Resource Language;2023 6th International Conference on Information and Computer Technologies (ICICT);2023-03
3. Self-Attention-based Data Augmentation Method for Text Classification;Proceedings of the 2023 15th International Conference on Machine Learning and Computing;2023-02-17