Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation-Reference-Cited by-同舟云学术

Morphologically Motivated Input Variations and Data Augmentation in Turkish-English Neural Machine Translation

Published:2023-03-10 Issue:3 Volume:22 Page:1-31
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Yi̇rmi̇beşoğlu Zeynep¹^ORCID,Güngör Tunga¹^ORCID

Affiliation:

1. Boğaziçi University, Sarıyer, İstanbul, Turkey

Abstract

Success of neural networks in natural language processing has paved the way for neural machine translation (NMT), which rapidly became the mainstream approach in machine translation. Significant improvement in translation performance has been achieved with breakthroughs such as encoder-decoder networks, attention mechanism, and Transformer architecture. However, the necessity of large amounts of parallel data for training an NMT system and rare words in translation corpora are issues yet to be overcome. In this article, we approach NMT of the low-resource Turkish-English language pair. We employ state-of-the-art NMT architectures and data augmentation methods that exploit monolingual corpora. We point out the importance of input representation for the morphologically rich Turkish language and make a comprehensive analysis of linguistically and non-linguistically motivated input segmentation approaches. We prove the effectiveness of morphologically motivated input segmentation for the Turkish language. Moreover, we show the superiority of the Transformer architecture over attentional encoder-decoder models for the Turkish-English language pair. Among the employed data augmentation approaches, we observe back-translation to be the most effective and confirm the benefit of increasing the amount of parallel data on translation quality. This research demonstrates a comprehensive analysis on NMT architectures with different hyperparameters, data augmentation methods, and input representation techniques, and proposes ways of tackling the low-resource setting of Turkish-English NMT.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3571073

Reference76 articles.

1. Ahmetaa. n.d. ahmetaa/zemberek-nlp. Retrieved April 1 2021 from https://github.com/ahmetaa/zemberek-nlp.

2. Linguistically Motivated Vocabulary Reduction for Neural Machine Translation from Turkish to English

3. Layer normalization;Ba Jimmy;arXiv,2016

4. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations (ICLR’16): Conference Track Proceedings.

5. Training Deeper Neural Machine Translation Models with Transparent Attention