The neural machine translation models for the low-resource Kazakh–English language pair-Reference-Cited by-同舟云学术

The neural machine translation models for the low-resource Kazakh–English language pair

Published:2023-02-08 Issue: Volume:9 Page:e1224
ISSN:2376-5992
Container-title:PeerJ Computer Science
language:en
Short-container-title:

Author:

Karyukin Vladislav¹,Rakhimova Diana¹²^ORCID,Karibayeva Aidana¹,Turganbayeva Aliya¹,Turarbek Asem¹

Affiliation:

1. Department of Information Systems, Al-Farabi Kazakh National University, Almaty, Kazakhstan

2. Institute of Information and Computational Technologies, Almaty, Kazakhstan

Abstract

The development of the machine translation field was driven by people’s need to communicate with each other globally by automatically translating words, sentences, and texts from one language into another. The neural machine translation approach has become one of the most significant in recent years. This approach requires large parallel corpora not available for low-resource languages, such as the Kazakh language, which makes it difficult to achieve the high performance of the neural machine translation models. This article explores the existing methods for dealing with low-resource languages by artificially increasing the size of the corpora and improving the performance of the Kazakh–English machine translation models. These methods are called forward translation, backward translation, and transfer learning. Then the Sequence-to-Sequence (recurrent neural network and bidirectional recurrent neural network) and Transformer neural machine translation architectures with their features and specifications are concerned for conducting experiments in training models on parallel corpora. The experimental part focuses on building translation models for the high-quality translation of formal social, political, and scientific texts with the synthetic parallel sentences from existing monolingual data in the Kazakh language using the forward translation approach and combining them with the parallel corpora parsed from the official government websites. The total corpora of 380,000 parallel Kazakh–English sentences are trained on the recurrent neural network, bidirectional recurrent neural network, and Transformer models of the OpenNMT framework. The quality of the trained model is evaluated with the BLEU, WER, and TER metrics. Moreover, the sample translations were also analyzed. The RNN and BRNN models showed a more precise translation than the Transformer model. The Byte-Pair Encoding tokenization technique showed better metrics scores and translation than the word tokenization technique. The Bidirectional recurrent neural network with the Byte-Pair Encoding technique showed the best performance with 0.49 BLEU, 0.51 WER, and 0.45 TER.

Funder

Ministry of Science and Higher Education of the Republic of Kazakhstan

Publisher

PeerJ

Subject

General Computer Science

Link

https://peerj.com/articles/cs-1224.pdf

Reference45 articles.

1. Enhanced back-translation for low resource neural machine translation using self-training;Abdulmumin;Communications in Computer and Information Science,2020

2. A hybrid approach for improved low resource neural machine translation using monolingual data;Abdulmumin;Engineering Letters,2020

3. Strengthening low-resource neural machine translation through joint learning: the case of Farsi-Spanish;Ahmadnia,2021