North Korean Neural Machine Translation through South Korean Resources-Reference-Cited by-同舟云学术

North Korean Neural Machine Translation through South Korean Resources

Published:2023-09-22 Issue:9 Volume:22 Page:1-22
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Kim Hwichan¹^ORCID,Tosho Hirasawa¹^ORCID,Moon Sangwhan²^ORCID,Okazaki Naoaki²^ORCID,Komachi Mamoru¹^ORCID

Affiliation:

1. Tokyo Metropolitan University, Japan

2. Tokyo Institute of Technology, Japan

Abstract

South and North Korea both use the Korean language. However, Korean natural language processing (NLP) research has mostly focused on South Korean language. Therefore, existing NLP systems in the Korean language, such as neural machine translation (NMT) systems, cannot properly process North Korean inputs. Training a model using North Korean data is the most straightforward approach to solving this problem, but the data to train NMT models are insufficient. To solve this problem, we constructed a parallel corpus to develop a North Korean NMT model using a comparable corpus. We manually aligned parallel sentences to create evaluation data and automatically aligned the remaining sentences to create training data. We trained a North Korean NMT model using our North Korean parallel data and improved North Korean translation quality using South Korean resources such as parallel data and a pre-trained model. In addition, we propose Korean-specific pre-processing methods, character tokenization, and phoneme decomposition to use the South Korean resources more efficiently. We demonstrate that the phoneme decomposition consistently improves the North Korean translation accuracy compared to other pre-processing methods.

Funder

TMU research fund for young scientists and JST

JSPS KAKENHI

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3608947

Reference56 articles.

1. Margin-based Parallel Corpus Mining with Multilingual Sentence Embeddings

2. Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond

3. Weighted Set-Theoretic Alignment of Comparable Sentences

4. Character-Level Transformer-Based Neural Machine Translation

5. Aligning sentences in parallel corpora

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Knowledge-Aware Prompt Learning Framework for Korean-Chinese Microblog Sentiment Analysis;ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP);2024-04-14