Unsupervised Parallel Sentences of Machine Translation for Asian Language Pairs

Author:

Zhu Shaolin1ORCID,Mi Chenggang2,Li Tianqi1,Yang Yong3,Xu Chun4

Affiliation:

1. Zhengzhou University of Light Industry, Zhengzhou, Henan, China

2. Northwestern Polytechnical University, School of Computer Science ForeignLanguage and Literature Institute, Xi’an International Studies University, China

3. Xinjiang Normal University, China

4. Xinjiang University of Finance and Economics, China

Abstract

Parallel sentence pairs play a very important role in many natural language processing tasks, especially cross-lingual tasks such as machine translation. So far, many Asian language pairs lack bilingual parallel sentences. As collecting bilingual parallel data is very time-consuming and difficult, it is very important for many low-resource Asian language pairs. While existing methods have shown encouraging results, they rely on bilingual data seriously or have some drawbacks in an unsupervised situation. To address these issues, we propose a new unsupervised similarity calculation and dynamic selection metric to obtain parallel sentence pairs in an unsupervised situation. First, our method maps bilingual word embedding by postdoc adversarial training, which rotates the source space to match the target without parallel data. Then, we introduce a new cross-domain similarity adaption to obtain parallel sentence pairs. Experimental results on real-world datasets show that our model can obtain better accuracy and recall on mining parallel sentence pairs. We also show that the extracted bilingual sentence corpora can significantly improve the performance of neural machine translation.

Funder

Zhengzhou University of Light Industry Doctoral Research

Xinjiang Autonomous Region University Research Plan

National Natural Science Foundation of China

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference43 articles.

1. Sarath Chandar, Stanislas Lauly, Hugo Larochelle, Mitesh Khapra, Balaraman Ravindran, Vikas C. Raykar, and Amrita Saha. 2014. An autoencoder approach to learning bilingual word representations. In Advances in Neural Information Processing Systems. MIT Press, 1853–1861.

2. An Effective Approach to Unsupervised Machine Translation

3. Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

4. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

5. Antonio Valerio Miceli-Barone. 2016. Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. Retrieved from https://arXiv:1608.02996.

Cited by 2 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3