Abstract
In this study, we propose a simple and effective preprocessing method for subword segmentation based on a data compression algorithm. Compression-based subword segmentation has recently attracted significant attention as a preprocessing method for training data in neural machine translation. Among them, BPE/BPE-dropout is one of the fastest and most effective methods compared to conventional approaches; however, compression-based approaches have a drawback in that generating multiple segmentations is difficult due to the determinism. To overcome this difficulty, we focus on a stochastic string algorithm, called locally consistent parsing (LCP), that has been applied to achieve optimum compression. Employing the stochastic parsing mechanism of LCP, we propose LCP-dropout for multiple subword segmentation that improves BPE/BPE-dropout, and we show that it outperforms various baselines in learning from especially small training data.
Funder
Japan Society for the Promotion of Science
Subject
Electrical and Electronic Engineering,Computer Networks and Communications,Hardware and Architecture,Signal Processing,Control and Systems Engineering
Cited by
8 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Efficient Text Compression Algorithms: Principles, Performance, and Applications;2024 5th International Conference on Intelligent Communication Technologies and Virtual Mobile Networks (ICICV);2024-03-11
2. An Automatic Assessment and Optimization Algorithm for English Translation Software Combining Deep Learning and Natural Language Processing;2024 International Conference on Electrical Drives, Power Electronics & Engineering (EDPEE);2024-02-27
3. Construction and Optimization of English Machine Translation Model Based on Hybrid Intelligent Algorithm;Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering;2024
4. Design of Computer Intelligent Proofreading Algorithm for English Translation Based on Markov Model;2023 International Conference on Internet of Things, Robotics and Distributed Computing (ICIRDC);2023-12-29
5. Research on English Translation of Intangible Cultural Heritage in the Age of AIGC;Applied Mathematics and Nonlinear Sciences;2023-12-16