Building natural language processing tools for Runyakitara

Author:

Katushemererwe Fridah1,Caines Andrew2,Buttery Paula3

Affiliation:

1. Department of Linguistics , Makerere University , Kampala , Uganda

2. Department of Theoretical & Applied Linguistics , University of Cambridge , Cambridge , UK

3. Computer Laboratory , University of Cambridge , Cambridge , UK

Abstract

Abstract This paper describes an endeavour to build natural language processing (NLP) tools for Runyakitara, a group of four closely related Bantu languages spoken in western Uganda. In contrast with major world languages such as English, for which corpora are comparatively abundant and NLP tools are well developed, computational linguistic resources for Runyakitara are in short supply. First therefore, we need to collect corpora for these languages, before we can proceed to the design of a spell-checker, grammar-checker and applications for computer-assisted language learning (CALL). We explain how we are collecting primary data for a new Runya Corpus of speech and writing, we outline the design of a morphological analyser, and discuss how we can use these new resources to build NLP tools. We are initially working with Runyankore–Rukiga, a closely-related pair of Runyakitara languages, and we frame our project in the context of NLP for low-resource languages, as well as CALL for the preservation of endangered languages. We put our project forward as a test case for the revitalization of endangered languages through education and technology.

Funder

The research that resulted into this article was funded by The Cambridge-Africa Programme for Research Excellence (CAPREx) and the ALBORADA Fund, UK

Publisher

Walter de Gruyter GmbH

Subject

Linguistics and Language,Language and Linguistics

Reference46 articles.

1. Abidi, Syed. 1989. Modern communication and national identity: An issue in East African context. In Jude J. Ongong’a & Kenneth R. Gray (eds.), Bottlenecks to national identity: Ethnic cooperation towards nation building. Nairobi: Professor World Peace Academy of Kenya.

2. Abney, Steven & Steven Bird. 2010. The human language project: Building a universal corpus of the world’s languages. In Proceedings of the 48th annual meeting of the association for computational linguistics, 88–97. Uppsala, Sweden.

3. Agic, Zeljko, Dirk Hovy & Anders Søgaard. 2015. If all you have is a bit of the Bible: Learning POS taggers for truly low-resource languages. In Proceedings of the 53rd annual meeting of the association for computational linguistics, 268–272. Beijing, China.

4. Allwood, Jens, Harald Hammarström, Andries Hendrikse, Mtholeni N. Ngcobo, Nozibele Nomdebevana, Laurette Pretorius & Mac van der Merwe. 2010. Work on spoken (multimodal) language corpora in South Africa. In Proceedings of the seventh international conference on language resources and evaluation, 885–889. Valletta, Malta.

5. Barlow, Michael. 1996. Corpora for theory and practice. International Journal of Corpus Linguistics 1(1). 1–37. https://doi.org/10.1075/ijcl.1.1.03bar.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3