Author:
Adewole Lawrence B,Adetunmbi Adebayo O,Alese Boniface K,Oluwadare Samuel A
Abstract
Recent methodologies in machine translation depend on the availability of large language corpora. The web being the repository for text and other multimedia content becomes a viable source for such data. However, there is need for text cleaning, as a pre-processing step, since foreign words are inevitably part of the harvested text. Dictionary lookup approach can be adopted for languages with comprehensive lexicon while manual cleaning approach is applied in other cases. Developing a full-coverage lexicon for Yoruba language is a cumbersome task due to the fact that new words can be formed as a result of elision, assimilation and contraction. In this paper, the morphology of Yorùbá language was studied and modelled as a Finite State Machine which accepts a word and returns true if the goal state is reached and false otherwise. The FSM model was implemented in Java. A Yorùbá dictionary containing 10,443 distinct words in their base form (i.e. without diacritics) and English dictionary with 64,150 distinct words were parsed through the finite state machine. In addition, 58 web pages sourced from the internet were subjected to classification by the system. Classification of entries from the Yoruba dictionary as valid Yoruba words gave 99.99% accuracy while the classification of entries from the English dictionary as Non-Yoruba words gave 94.07% accuracy. Also, using the threshold of 90% valid Yoruba words in a webpage, all 58 webpages were correctly classified. Result obtained revealed that the approach can reliably be applied in automatic harvesting of Yoruba monolingual corpus from the internet.
Publisher
Faculty of Engineering, Federal University Oye-Ekiti
Cited by
4 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Bilingual Neural Machine Translation From English To Yoruba Using A Transformer Model;International Journal of Innovative Science and Research Technology (IJISRT);2024-07-26
2. A State-of-the-Art Review of Nigerian Languages Natural Language Processing Research;Research Anthology on Applied Linguistics and Language Practices;2022-04-01
3. A State-of-the-Art Review of Nigerian Languages Natural Language Processing Research;Advances in IT Standards and Standardization Research;2021
4. Automatic Vowel Elision Resolution in Yorùbá Language;Conference of the South African Institute of Computer Scientists and Information Technologists 2020;2020-09-11