Low resource Twi-English parallel corpus for machine translation in multiple domains (Twi-2-ENG)

Author:

Agyei Emmanuel,Zhang Xiaoling,Bannerman Stephen,Quaye Ama Bonuah,Yussi Sophyani Banaamwini,Agbesi Victor Kwaku

Abstract

AbstractAlthough Ghana does not have one unique language for its citizens, the Twi dialect stands a chance of fulfilling this purpose. Twi is among the low-resourced language categories, yet it is widely spoken beyond Ghana and in countries such as the Ivory Coast, Benin, Nigeria, and other places. However, it continues to be seen as the perfect resource for Twi Machine Translation (MT) of IS0 639-3. The issue with the Twi-English parallel corpus is eminent at the multiple domain dataset level, partly due to the complex design structure and scarcity of the digital Twi lexicon. This study introduced Twi-2-ENG, a large-scale multiple domain Twi to English parallel corpus, Twi digital Dictionary, and lexicon version of Twi. Also, it employed the Ghanaian Parliamentary Hansards, crowdsourcing, and digital Ghana News Portals to crawl all the English sentences. Our curled news portals accumulated 5,765 parallel corpus sentences, the Twi New Testament Bible, and social media platforms. The data-gathering method used means of translation, compilation, tokenization, and the final alignments with the Twi-English parallel sentences, including the technology employed in compiling and hosting the corpus, were duly discussed. The results reveal that the role of manually qualified linguistic professionals and Twi translation specialists across the media spectrum, academia, and well-wishers adds a considerable volume to the Twi-2-ENG parallel corpus. Finally, all the sentences were curated with the help of a corpus manager, sketch engine, linguistics, and professional translators to align and tokenize all texts, allowing the Twi professional linguists to evaluate the corpus.

Publisher

Springer Science and Business Media LLC

Reference56 articles.

1. Aboagye Da-Costa C, Adade-Yeboah A. Language practice and the dilemma of a national language policy in Ghana: the past, present and future. Int J Human Soc Sci. 2019. https://doi.org/10.30845/ijhss.v9n3p18.

2. Adebara I, Abdul-Mageed M. Towards Afrocentric NLP for African languages: Where we are and where we can go. arXiv preprint. 2022. arXiv:2203.08351.

3. Adjeisah M, Liu G, Nyabuga DO, Nortey RN, Song J. Pseudotext injection and advance filtering of low-resource corpus for neural machine translation. Comput Intell Neurosci. 2021;2021(1):6682385.

4. Afram GK, Weyori BA, Adekoya FA. TWIENG: a multi-domain Twi-english parallel corpus for machine translation of Twi, a Low-Resource African Language. 2022.

5. Alabi J, Amponsah-Kaakyire K, Adelani D, Espana-Bonet C. Massive vs curated embeddings for low-resourced languages: the case of Yorùbá and Twi. In Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020, May; 2754–2762.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3