Affiliation:
1. Graduate School of Information Science and Technology, Department of Creative Informatics, The University of Tokyo, Tokyo 113-8654, Japan
2. Department of Computer Science & Information Technology, The University of Lahore, Lahore 54590, Pakistan
Abstract
Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach.
Subject
Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science
Reference27 articles.
1. Toutanova, K., and Cherry, C. (2009, January 2–7). A global model for joint lemmatization and part-of-speech prediction. Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore.
2. Bonatti, R., de Paula, A.G., Lamarca, V.S., and Cozman, F.G. (2016, January 12–13). Effect of part-of-speech and lemmatization filtering in email classification for automatic reply. Proceedings of the Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA.
3. Morphologically rich Urdu grammar parsing using Earley algorithm;Abbas;Nat. Lang. Eng.,2016
4. A survey on Urdu and Urdu like language stemmers and stemming techniques;Jabbar;Artif. Intell. Rev.,2018
5. Riaz, K. (2008, January 30). Concept search in Urdu. Proceedings of the 2nd PhD Workshop on Information and Knowledge Management, Napa Valley, CA, USA.
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献