UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study-Reference-Cited by-同舟云学术

UPC: An Open Word-Sense Annotated Parallel Corpora for Machine Translation Study

Published:2020-06-04 Issue:11 Volume:10 Page:3904
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Vu Van-Hai,Nguyen Quang-Phuoc,Shin Joon-Choul,Ock Cheol-Young

Abstract

Machine translation (MT) has recently attracted much research on various advanced techniques (i.e., statistical-based and deep learning-based) and achieved great results for popular languages. However, the research on it involving low-resource languages such as Korean often suffer from the lack of openly available bilingual language resources. In this research, we built the open extensive parallel corpora for training MT models, named Ulsan parallel corpora (UPC). Currently, UPC contains two parallel corpora consisting of Korean-English and Korean-Vietnamese datasets. The Korean-English dataset has over 969 thousand sentence pairs, and the Korean-Vietnamese parallel corpus consists of over 412 thousand sentence pairs. Furthermore, the high rate of homographs of Korean causes an ambiguous word issue in MT. To address this problem, we developed a powerful word-sense annotation system based on a combination of sub-word conditional probability and knowledge-based methods, named UTagger. We applied UTagger to UPC and used these corpora to train both statistical-based and deep learning-based neural MT systems. The experimental results demonstrated that using UPC, high-quality MT systems (in terms of the Bi-Lingual Evaluation Understudy (BLEU) and Translation Error Rate (TER) score) can be built. Both UPC and UTagger are available for free download and usage.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/10/11/3904/pdf

Reference52 articles.

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Low-Resource Neural Machine Translation: A Systematic Literature Review;IEEE Access;2023

2. Korean-Vietnamese Neural Machine Translation at Sub-word Level;Advances in Computer Science and Ubiquitous Computing;2023

3. Special Issue on Machine Learning and Natural Language Processing;Applied Sciences;2022-09-05

4. Word Sense Disambiguation Using Clustered Sense Labels;Applied Sciences;2022-02-11

5. Improving the Performance of Vietnamese–Korean Neural Machine Translation with Contextual Embedding;Applied Sciences;2021-11-23