Abstract
Abstract
Part of speech (POS) tagging, though considered to be preliminary to any Natural Language Processing (NLP) task, is crucial to account for, especially in low resource language like Khasi that lacks any form of formal corpus. POS tagging is context sensitive. Therefore, the task is challenging. In this paper, we attempt to investigate a deep learning approach to the POS tagging problem in Khasi. A deep learning model called Robustly Optimized BERT Pretraining Approach (RoBERTa) is pretrained for language modelling task. We then create RoBERTa for POS (RoPOS) tagging, a model that performs POS tagging by fine-tuning the pretrained RoBERTa and leveraging its embeddings for downstream POS tagging. The existing tagset that has been designed, customarily, for the Khasi language is employed for this work, and the corresponding tagged dataset is taken as our base corpus. Further, we propose additional tags to this existing tagset to meet the requirements of the language and have increased the size of the existing Khasi POS corpus. Other machine learning and deep learning models have also been tried and tested for the same task, and a comparative analysis is made on the various models employed. Two different setups have been used for the RoPOS model, and the best testing accuracy achieved is 92 per cent. Comparative analysis of RoPOS with the other models indicates that RoPOS outperforms the others when used for inferencing on texts that are outside the domain of the POS tagged training dataset.
Publisher
Cambridge University Press (CUP)
Reference28 articles.
1. A Hybrid POS Tagger for Khasi, an Under Resourced Language
2. Liu, Y. , Ott, M. , Goyal, N. , Joshi, M. , Chen, D. , Levy, O. , Lewis, M. , Zettlemoyer, L. and Stoyanov, V. (2019). A Robustly Optimized BERT Pretraining Approach. ArXiv, abs/1907.11692.
3. Part-of-Speech Tagging of Odia Language Using Statistical and Deep Learning Based Approaches
4. Research on machine learning techniques for POS tagging in NLP;Bulusu;International Journal of Recent Technology and Engineering,2019
5. Bidirectional Grid Long Short-Term Memory (BiGridLSTM): A Method to Address Context-Sensitivity and Vanishing Gradient