Mostly-unsupervised statistical segmentation of Japanese kanji sequences-Reference-Cited by-同舟云学术

Mostly-unsupervised statistical segmentation of Japanese kanji sequences

Published:2003-06 Issue:2 Volume:9 Page:127-149
ISSN:1351-3249
Container-title:Natural Language Engineering
language:en
Short-container-title:Nat. Lang. Eng.

Author:

ANDO RIE KUBOTA,LEE LILLIAN

Abstract

Given the lack of word delimiters in written Japanese, word segmentation is generally considered a crucial first step in processing Japanese texts. Typical Japanese segmentation algorithms rely either on a lexicon and syntactic analysis or on pre-segmented data; but these are labor-intensive, and the lexico-syntactic techniques are vulnerable to the unknown word problem. In contrast, we introduce a novel, more robust statistical method utilizing unsegmented training data. Despite its simplicity, the algorithm yields performance on long kanji sequences comparable to and sometimes surpassing that of state-of-the-art morphological analyzers over a variety of error metrics. The algorithm also outperforms another mostly-unsupervised statistical algorithm previously proposed for Chinese. Additionally, we present a two-level annotation scheme for Japanese to incorporate multiple segmentation granularities, and introduce two novel evaluation metrics, both based on the notion of a compatible bracket, that can account for multiple granularities simultaneously.

Publisher

Cambridge University Press (CUP)

Subject

Artificial Intelligence,Linguistics and Language,Language and Linguistics,Software

Cited by 13 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Practical and Robust Chinese Word Segmentation and PoS Tagging;Chinese Language Resources;2023

2. Giving Space to Your Message: Assistive Word Segmentation for the Electronic Typing of Digital Minorities;Designing Interactive Systems Conference 2021;2021-06-28

3. Thai Words Segmentation Using an Unsupervised Learning Technique;Recent Advances in Information and Communication Technology 2020;2020

4. A Query Suggestion Workflow for Life Science IR-Systems;J INTEGR BIOINFORMAT;2014

5. Splitting Katakana Noun Compounds by Paraphrasing and Back-transliteration;Journal of Natural Language Processing;2014