Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs-Reference-Cited by-同舟云学术

Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

Published:2023-03-10 Issue:3 Volume:22 Page:1-14
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Zhu Shaolin¹^ORCID,Mi Chenggang²,Li Tianqi¹,Yang Yong³,Xu Chun⁴

Affiliation:

1. Zhengzhou University of Light Industry, Zhengzhou, Henan, China

2. Northwestern Polytechnical University, School of Computer Science ForeignLanguage and Literature Institute, Xi’an International Studies University, China

3. Xinjiang Normal University, China

4. Xinjiang University of Finance and Economics, China

Abstract

Parallel sentence pairs play a very important role in many natural language processing tasks, especially cross-lingual tasks such as machine translation. So far, many Asian language pairs lack bilingual parallel sentences. As collecting bilingual parallel data is very time-consuming and difficult, it is very important for many low-resource Asian language pairs. While existing methods have shown encouraging results, they rely on bilingual data seriously or have some drawbacks in an unsupervised situation. To address these issues, we propose a new unsupervised similarity calculation and dynamic selection metric to obtain parallel sentence pairs in an unsupervised situation. First, our method maps bilingual word embedding by postdoc adversarial training, which rotates the source space to match the target without parallel data. Then, we introduce a new cross-domain similarity adaption to obtain parallel sentence pairs. Experimental results on real-world datasets show that our model can obtain better accuracy and recall on mining parallel sentence pairs. We also show that the extracted bilingual sentence corpora can significantly improve the performance of neural machine translation.

Funder

Zhengzhou University of Light Industry Doctoral Research

Xinjiang Autonomous Region University Research Plan

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3486677

Reference43 articles.

1. Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. MIT Press, 1853–1861.

2. An Effective Approach to Unsupervised Machine Translation

3. Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

4. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

5. Antonio Valerio Miceli-Barone. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. Retrieved from https://arXiv:1608.02996.

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Fine Tuning Language Models: A Tale of Two Low-Resource Languages;Data Intelligence;2024-07-01

2. Challenges in Corpus Construction for Thai-English Machine Translation;Intelligent Computing & Optimization;2022-10-21