Author:
Yadava Yogendra P.,Hardie Andrew,Lohani Ram Raj,Regmi Bhim N.,Gurung Srishtee,Gurung Amar,McEnery Tony,Allwood Jens,Hall Pat
Abstract
In this paper, we describe the construction of the 14-million-word Nepali National Corpus (NNC). This corpus includes both spoken and written data, the latter incorporating a Nepali match for FLOB and a broader collection of text. Additional resources within the NNC include parallel data (English–Nepali and Nepali–English) and a speech corpus. The NNC is encoded as Unicode text and marked up in CES-compatible XML. The whole corpus is also annotated with part-of-speech tags. We describe the process of devising a tagset and retraining tagger software for the Nepali language, for which there were no existing corpus resources. Finally, we explore some present and future applications of the corpus, including lexicography, NLP, and grammatical research.
Publisher
Edinburgh University Press
Subject
Linguistics and Language,Language and Linguistics
Cited by
15 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献