Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons-Reference-Cited by-同舟云学术

Building Text and Speech Benchmark Datasets and Models for Low‐Resourced East African Languages: Experiences and Lessons

Published:2024-03-26 Issue:2 Volume:5 Page:
ISSN:2689-5595
Container-title:Applied AI Letters
language:en
Short-container-title:Applied AI Letters

Author:

Nakatumba‐Nabende Joyce¹^ORCID,Babirye Claire²,Nabende Peter³,Tusubira Jeremy Francis²,Mukiibi Jonathan²,Wairagala Eric Peter²,Mutebi Chodrine²,Bateesa Tobius Saul²,Nahabwe Alvin²,Tusiime Hewitt²,Katumba Andrew⁴

Affiliation:

1. Department of Computer Science Makerere University Kampala Uganda

2. Makerere Artificial Intelligence Lab Makerere University Kampala Uganda

3. Department of Information Systems Makerere University Kampala Uganda

4. Department of Electrical and Computer Engineering Makerere University Kampala Uganda

Abstract

ABSTRACTAfrica has over 2000 languages; however, those languages are not well represented in the existing natural language processing ecosystem. African languages lack essential digital resources to effectively engage in advancing language technologies. There is a need to generate high‐quality natural language processing resources for low‐resourced African languages. Obtaining high‐quality speech and text data is expensive and tedious because it can involve manual sourcing and verification of data sources. This paper discusses the process taken to curate and annotate text and speech datasets for five East African languages: Luganda, Runyankore‐Rukiga, Acholi, Lumasaba, and Swahili. We also present results obtained from baseline models for machine translation, topic modeling and classification, sentiment classification, and automatic speech recognition tasks. Finally, we discuss the experiences, challenges, and lessons learned in creating the text and speech datasets.

Publisher

Wiley

Link

https://onlinelibrary.wiley.com/doi/pdf/10.1002/ail2.92

Reference59 articles.

1. “Without English There Is No Future”: The Case of Language Attitudes and Ideologies in Uganda

2. L.MartinusandJ. Z.Abbott “A Focus on Neural Machine Translation for African Languages ” 2019 arXiv Preprint arXiv:1906.05685.

3. V.Marivate T.Sefara V.Chabalala et al. “Investigating an Approach for Low Resource Language Dataset Creation Curation and Classification: Setswana and Sepedi ” 2020 arXiv Preprint arXiv:2003.04986.

4. MasakhaNER: Named Entity Recognition for African Languages

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Developing and Deploying End‐to‐End Machine Learning Systems for Social Impact: A Rubric and Practical Artificial Intelligence Case Studies From African Contexts;Applied AI Letters;2024-08-27