A Chinese dictionary construction algorithm for information retrieval-Reference-Cited by-同舟云学术

A Chinese dictionary construction algorithm for information retrieval

Published:2002-12 Issue:4 Volume:1 Page:281-296
ISSN:1530-0226
Container-title:ACM Transactions on Asian Language Information Processing
language:en
Short-container-title:ACM Transactions on Asian Language Information Processing

Author:

Jin Honglan¹,Wong Kam-Fai¹

Affiliation:

1. The Chinese University of Hong Kong

Abstract

In this article we propose a method for constructing, from raw Chinese text, a statistics-based automatic dictionary. The method makes use of local statistical information (i.e., data within a document) to identify and discard repeated string patterns, which, at an earlier stage, were substrings of legitimate words. Global statistical information (which exists throughout the entire corpus) and contextual constraints are then used for further filtering. The method can be used to alleviate the out-of-vocabulary (OOV) problem, which is commonly found in dictionary-based natural language information-processing applications, e.g., word segmentation. It can handle text corpora dynamically and, further, it does not impose any strict requirements on the size and quality of the training corpora. Based on our method, we constructed Chinese dictionaries from different Chinese corpora. We then applied the words in the constructed dictionaries to indexing in information retrieval (IR). Retrieval performance using such indexes was compared to the same, but based on indexes produced by static dictionaries. Three Chinese corpora using various character-encoding schemes and language styles were used in the experiments. The results show that retrieval using indexes based on the constructed dictionary is effective. This implies that fully automatic Chinese dictionary construction based on dynamic data sources, e.g., from the Internet, for the purposes of IR is feasible. Drawing on the experiment, we were able to make some interesting observations: (1) using only a portion of a dictionary is enough to produce good retrieval performance, e.g., a dictionary consisting of only the 500 highest-frequency strings extracted from the NTCIR 2 Chinese corpus produced as good a retrieval result as using a more complete dictionary with over 100K entries; and (2) complete word segmentation is not a strict requirement for achieving practical information retrieval.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Link

https://dl.acm.org/doi/pdf/10.1145/795458.795460

Reference16 articles.

1. An unsupervised iterative method for Chinese new lexicon extraction;CHANG J.-S.;Comput. Linguist. Chinese Lang. Process.,1997

2. Unknown word detection for Chinese by a corpus-based learning method;CHEN K.-J.;Comput. Linguist. Chinese Lang. Process.,1998

3. PAT-tree-based adaptive key phrase extraction for intelligent Chinese information retrieval;CHIEN L.-F.;Inf. Process. Manage.,1999

4. Important issues on Chinese information retrieval;CHIEN L.-F.;Comput. Linguist. Chinese Lang. Process.,1996

Cited by 10 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Language discrimination by texture analysis of the image corresponding to the text;Neural Computing and Applications;2016-08-19

2. Research and Implementation of Intelligent Terminals Content Management System Based on Android Platform;Applied Mechanics and Materials;2013-09

3. Real Estate Management Information System;Lecture Notes in Electrical Engineering;2013

4. Improved N-grams Approach for Web Page Language Identification;Transactions on Computational Collective Intelligence V;2011

5. Introduction to Chinese Natural Language Processing;Synthesis Lectures on Human Language Technologies;2009-01