Pseudotext Injection and Advance Filtering of Low-Resource Corpus for Neural Machine Translation

Author:

Adjeisah Michael1ORCID,Liu Guohua1ORCID,Nyabuga Douglas Omwenga1ORCID,Nortey Richard Nuetey2ORCID,Song Jinling3ORCID

Affiliation:

1. School of Computer Science and Technology, Donghua University, Shanghai, China

2. School of Information Science and Technology, Donghua University, Shanghai, China

3. School of Mathematics and Information Technology, Hebei Normal University of Science & Technology, Qinhuangdao, Hebei, China

Abstract

Scaling natural language processing (NLP) to low-resourced languages to improve machine translation (MT) performance remains enigmatic. This research contributes to the domain on a low-resource English-Twi translation based on filtered synthetic-parallel corpora. It is often perplexing to learn and understand what a good-quality corpus looks like in low-resource conditions, mainly where the target corpus is the only sample text of the parallel language. To improve the MT performance in such low-resource language pairs, we propose to expand the training data by injecting synthetic-parallel corpus obtained by translating a monolingual corpus from the target language based on bootstrapping with different parameter settings. Furthermore, we performed unsupervised measurements on each sentence pair engaging squared Mahalanobis distances, a filtering technique that predicts sentence parallelism. Additionally, we extensively use three different sentence-level similarity metrics after round-trip translation. Experimental results on a diverse amount of available parallel corpus demonstrate that injecting pseudoparallel corpus and extensive filtering with sentence-level similarity metrics significantly improves the original out-of-the-box MT systems for low-resource language pairs. Compared with existing improvements on the same original framework under the same structure, our approach exhibits tremendous developments in BLEU and TER scores.

Funder

Development of Shanghai Industrial Internet

Publisher

Hindawi Limited

Subject

General Mathematics,General Medicine,General Neuroscience,General Computer Science

Reference37 articles.

1. Benchmarking neural machine translation for southern African languages;L. Martinus

2. Six challenges for neural machine translation;P. Koehn

3. Exploiting source-side monolingual data in neural machine translation;J. Zhang

4. Improving neural machine translation models with monolingual data;R. Sennrich

5. Measuring sentence parallelism using Mahalanobis distances: the NRC unsupervised submissions to the WMT18 parallel corpus filtering shared task;P. Littell

Cited by 7 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Joint multimodal sentiment analysis based on information relevance;Information Processing & Management;2023-03

2. Towards data augmentation in graph neural network: An overview and evaluation;Computer Science Review;2023-02

3. Low-Resource Neural Machine Translation: A Systematic Literature Review;IEEE Access;2023

4. Design of Interactive English-Chinese Machine Translation System Based on Mobile Internet;2022 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC);2022-04-14

5. Synchronously Improving Multi-user English Translation Ability by Using AI;International Journal on Artificial Intelligence Tools;2021-11-20

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3