SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation-Reference-Cited by-同舟云学术

SelfSeg: A Self-supervised Sub-word Segmentation Method for Neural Machine Translation

Published:2023-08-24 Issue:8 Volume:22 Page:1-24
ISSN:2375-4699
Container-title:ACM Transactions on Asian and Low-Resource Language Information Processing
language:en
Short-container-title:ACM Trans. Asian Low-Resour. Lang. Inf. Process.

Author:

Song Haiyue¹^ORCID,Dabre Raj²^ORCID,Chu Chenhui¹^ORCID,Kurohashi Sadao¹^ORCID,Sumita Eiichiro²^ORCID

Affiliation:

1. Kyoto University, Japan

2. National Institute of Information and Communications Technology, Japan

Abstract

Sub-word segmentation is an essential pre-processing step for Neural Machine Translation (NMT). Existing work has shown that neural sub-word segmenters are better than Byte-Pair Encoding (BPE), however, they are inefficient, as they require parallel corpora, days to train, and hours to decode. This article introduces SelfSeg, a self-supervised neural sub-word segmentation method that is much faster to train/decode and requires only monolingual dictionaries instead of parallel corpora. SelfSeg takes as input a word in the form of a partially masked character sequence, optimizes the word generation probability, and generates the segmentation with the maximum posterior probability, which is calculated using a dynamic programming algorithm. The training time of SelfSeg depends on word frequencies, and we explore several word frequency normalization strategies to accelerate the training phase. Additionally, we propose a regularization mechanism that allows the segmenter to generate various segmentations for one word. To show the effectiveness of our approach, we conduct MT experiments in low-, middle-, and high-resource scenarios, where we compare the performance of using different segmentation methods. The experimental results demonstrate that, on the low-resource ALT dataset, our method achieves more than 1.2 BLEU score improvement compared with BPE and SentencePiece, and a 1.1 score improvement over Dynamic Programming Encoding (DPE) and Vocabulary Learning via Optimal Transport (VOLT), on average. The regularization method achieves approximately a 4.3 BLEU score improvement over BPE and a 1.2 BLEU score improvement over BPE-dropout, the regularized version of BPE. We also observed significant improvements on IWSLT15 Vi→En, WMT16 Ro→En, and WMT15 Fi→En datasets and competitive results on the WMT14 De→En and WMT14 Fr→En datasets. Furthermore, our method is 17.8× faster during training and up to 36.8× faster during decoding in a high-resource scenario compared to DPE. We provide extensive analysis, including why monolingual word-level data is enough to train SelfSeg.

Funder

JSPS KAKENHI

Young Scientists

JSPS Research Fellow for Young Scientists

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/3610611

Reference68 articles.

1. Automatic Text Scoring Using Neural Networks

2. Dzmitry Bahdanau Kyunghyun Cho and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate. arXiv e-prints Article arXiv:1409.0473 (Sept.2014). arXiv:1409.0473

3. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Association for Computational Linguistics, 65–72. Retrieved from https://aclanthology.org/W05-0909

4. Tom B. Brown Benjamin Mann Nick Ryder Melanie Subbiah Jared Kaplan Prafulla Dhariwal Arvind Neelakantan Pranav Shyam Girish Sastry Amanda Askell Sandhini Agarwal Ariel Herbert-Voss Gretchen Krueger Tom Henighan Rewon Child Aditya Ramesh Daniel M. Ziegler Jeffrey Wu Clemens Winter Christopher Hesse Mark Chen Eric Sigler Mateusz Litwin Scott Gray Benjamin Chess Jack Clark Christopher Berner Sam McCandlish Alec Radford Ilya Sutskever and Dario Amodei. 2020. Language Models are Few-shot Learners. arXiv:arXiv:2005.14165

5. Kris Cao and Laura Rimell. 2021. You Should Evaluate Your Language Model on Marginal Likelihood over Tokenisations. arXiv:arXiv:2109.02550

Cited by 2 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A Benchmark for Morphological Segmentation in Uyghur and Kazakh;Applied Sciences;2024-06-21

2. DiverSeg: Leveraging Diverse Segmentations with Cross-granularity Alignment for Neural Machine Translation;Journal of Natural Language Processing;2024