Affiliation:
1. Laboratory of Language Engineering and Computing, Guangdong University of Foreign Studies, 510420 Guangzhou, Guangdong, China
Abstract
Vietnamese tokenization is a challenging basic issue, and the corresponding algorithms can be used in many applications of natural language processing. In this paper, we investigate the Vietnamese tokenization problem and propose a supervised ensemble learning (SEL) framework as well as a SEL-based tokenization (SELT) algorithm. Supported by the data structure of syllable-syllable frequency index, the SELT algorithm combines multiple weak tokenizers to form a strong tokenizer. Within the SEL framework, we also investigate the efficient construction problem of a weak tokenizer. We suggest two prediction methods to select a suitable dictionary, and efficiently implement two weak tokenizers by the simple dictionary-based tokenization algorithm. The experimental results show that the SELT algorithm integrating our weak tokenizers can achieve state-of-the-art performance in the Vietnamese tokenization task.
Publisher
World Scientific Pub Co Pte Lt
Subject
Artificial Intelligence,Information Systems,Control and Systems Engineering,Software
Cited by
3 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献