Affiliation:
1. Punjab University College of Information Technology, Lahore, Pakistan
Abstract
Named Entity Recognition (NER) plays a pivotal role in various natural language processing tasks, such as machine translation and automatic question-answering systems. Recognizing the importance of NER, a plethora of NER techniques for Western and Asian languages have been developed. However, despite having over 490 million Urdu language speakers worldwide, NER resources for Urdu are either non-existent or inadequate. To fill this gap, this article makes four key contributions. First, we have developed the largest Urdu NER corpus, which contains 926,776 tokens and 99,718 carefully annotated NEs. The developed corpus has at least doubled the number of manually tagged NEs as compared to any of the existing Urdu NER corpora. Second, we have generated six new word embeddings using three different techniques, fastText, Word2vec, and Glove, on two corpora of Urdu text. These are the only publicly available embeddings for the Urdu language, besides the recently released Urdu word embeddings by Facebook. Third, we have pioneered in the application of deep learning techniques, NN and RNN, for Urdu named entity recognition. Finally, we have performed 10-folds of 32 different experiments using the combinations of a traditional supervised learning and deep learning techniques, seven types of word embeddings, and two different Urdu NER datasets. Based on the analysis of the results, several valuable insights are provided about the effectiveness of deep learning techniques, the impact of word embeddings, and variations of datasets.
Publisher
Association for Computing Machinery (ACM)
Cited by
38 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献