Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)-Reference-Cited by-同舟云学术

Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

Published:2024-07-05 Issue:1 Volume:27 Page:
ISSN:2948-2992
Container-title:Discover Computing
language:en
Short-container-title:Discov Computing

Author:

Agyei Emmanuel,Zhang Xiaoling,Bannerman Stephen,Quaye Ama Bonuah,Yussi Sophyani Banaamwini,Agbesi Victor Kwaku

Abstract

AbstractAlthough Ghana does not have one unique language for its citizens, the Twi dialect stands a chance of fulfilling this purpose. Twi is among the low-resourced language categories, yet it is widely spoken beyond Ghana and in countries such as the Ivory Coast, Benin, Nigeria, and other places. However, it continues to be seen as the perfect resource for Twi Machine Translation (MT) of IS0 639-3. The issue with the Twi-English parallel corpus is eminent at the multiple domain dataset level, partly due to the complex design structure and scarcity of the digital Twi lexicon. This study introduced Twi-2-ENG, a large-scale multiple domain Twi to English parallel corpus, Twi digital Dictionary, and lexicon version of Twi. Also, it employed the Ghanaian Parliamentary Hansards, crowdsourcing, and digital Ghana News Portals to crawl all the English sentences. Our curled news portals accumulated 5,765 parallel corpus sentences, the Twi New Testament Bible, and social media platforms. The data-gathering method used means of translation, compilation, tokenization, and the final alignments with the Twi-English parallel sentences, including the technology employed in compiling and hosting the corpus, were duly discussed. The results reveal that the role of manually qualified linguistic professionals and Twi translation specialists across the media spectrum, academia, and well-wishers adds a considerable volume to the Twi-2-ENG parallel corpus. Finally, all the sentences were curated with the help of a corpus manager, sketch engine, linguistics, and professional translators to align and tokenize all texts, allowing the Twi professional linguists to evaluate the corpus.

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10791-024-09451-8.pdf

Reference56 articles.

1. Aboagye Da-Costa C, Adade-Yeboah A. Language practice and the dilemma of a national language policy in Ghana: the past, present and future. Int J Human Soc Sci. 2019. https://doi.org/10.30845/ijhss.v9n3p18.

2. Adebara I, Abdul-Mageed M. Towards Afrocentric NLP for African languages: Where we are and where we can go. arXiv preprint. 2022. arXiv:2203.08351.

3. Adjeisah M, Liu G, Nyabuga DO, Nortey RN, Song J. Pseudotext injection and advance filtering of low-resource corpus for neural machine translation. Comput Intell Neurosci. 2021;2021(1):6682385.

4. Afram GK, Weyori BA, Adekoya FA. TWIENG: a multi-domain Twi-english parallel corpus for machine translation of Twi, a Low-Resource African Language. 2022.

5. Alabi J, Amponsah-Kaakyire K, Adelani D, Espana-Bonet C. Massive vs curated embeddings for low-resourced languages: the case of Yorùbá and Twi. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020, May; 2754–2762.