A Chinese–Kazakh Translation Method That Combines Data Augmentation and R-Drop Regularization-Reference-Cited by-同舟云学术

A Chinese–Kazakh Translation Method That Combines Data Augmentation and R-Drop Regularization

Published:2023-09-22 Issue:19 Volume:13 Page:10589
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Liu Canglan¹²³^ORCID,Silamu Wushouer¹²³,Li Yanbing¹²³

Affiliation:

1. College of Computer Science and Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

2. Xinjiang Laboratory of Multi-Language Information Technology, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

3. Xinjiang Multilingual Information Technology Research Center, Xinjiang University, No. 777 Huarui Street, Urumqi 830017, China

Abstract

Low-resource languages often face the problem of insufficient data, which leads to poor quality in machine translation. One approach to address this issue is data augmentation. Data augmentation involves creating new data by transforming existing data through methods such as flipping, cropping, rotating, and adding noise. Traditionally, pseudo-parallel corpora are generated by randomly replacing words in low-resource language machine translation. However, this method can introduce ambiguity, as the same word may have different meanings in different contexts. This study proposes a new approach for low-resource language machine translation, which involves generating pseudo-parallel corpora by replacing phrases. The performance of this approach is compared with other data augmentation methods, and it is observed that combining it with other data augmentation methods further improves performance. To enhance the robustness of the model, R-Drop regularization is also used. R-Drop is an effective method for improving the quality of machine translation. The proposed method was tested on Chinese–Kazakh (Arabic script) translation tasks, resulting in performance improvements of 4.99 and 7.7 for Chinese-to-Kazakh and Kazakh-to-Chinese translations, respectively. By combining the generation of pseudo-parallel corpora through phrase replacement with the application of R-Drop regularization, there is a significant advancement in machine translation performance for low-resource languages.

Funder

National Natural Science Foundation of China Joint Fund

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/13/19/10589/pdf

Reference52 articles.

1. Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv.

2. Luong, M.T., Pham, H., and Manning, C.D. (2015). Effective approaches to attention-based neural machine translation. arXiv.

3. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., and Polosukhin, I. (2017, January 4–9). Attention is all you need. Proceedings of the 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

4. Wu, Y., Schuster, M., Chen, Z., Le, Q.V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., and Macherey, K. (2016). Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv.

5. Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y.N. (2017, January 6–11). Convolutional sequence to sequence learning. Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia.