Affiliation:
1. Microsoft Research (Asia), Beijing, China
2. Microsoft Research (Redmond), Washington
Abstract
This article presents a unified approach to Chinese statistical language modeling (SLM). Applying SLM techniques like trigram language models to Chinese is challenging because (1) there is no standard definition of words in Chinese; (2) word boundaries are not marked by spaces; and (3) there is a dearth of training data. Our unified approach automatically and consistently gathers a high-quality training data set from the Web, creates a high-quality lexicon, segments the training data using this lexicon, and compresses the language model, all by using the maximum likelihood principle, which is consistent with trigram model training. We show that each of the methods leads to improvements over standard SLM, and that the combined method yields the best pinyin conversion result reported.
Publisher
Association for Computing Machinery (ACM)
Reference32 articles.
1. BERTON A. FETTER P. AND REGEL-BRIETZMANN P. 1996. Compound words in large-vocabulary German speech recognition systems. ICSLP96. BERTON A. FETTER P. AND REGEL-BRIETZMANN P. 1996. Compound words in large-vocabulary German speech recognition systems. ICSLP96.
2. Class-based n-gram models of natural language;BROWN P. F.;Comput. Linguist.,1990
3. An empirical study of smoothing techniques for language modeling;CHEN S. F.;Computer Speech and Language,1999
Cited by
31 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献