A Chinese dictionary construction algorithm for information retrieval

Author:

Jin Honglan1,Wong Kam-Fai1

Affiliation:

1. The Chinese University of Hong Kong

Abstract

In this article we propose a method for constructing, from raw Chinese text, a statistics-based automatic dictionary. The method makes use of local statistical information (i.e., data within a document) to identify and discard repeated string patterns, which, at an earlier stage, were substrings of legitimate words. Global statistical information (which exists throughout the entire corpus) and contextual constraints are then used for further filtering. The method can be used to alleviate the out-of-vocabulary (OOV) problem, which is commonly found in dictionary-based natural language information-processing applications, e.g., word segmentation. It can handle text corpora dynamically and, further, it does not impose any strict requirements on the size and quality of the training corpora. Based on our method, we constructed Chinese dictionaries from different Chinese corpora. We then applied the words in the constructed dictionaries to indexing in information retrieval (IR). Retrieval performance using such indexes was compared to the same, but based on indexes produced by static dictionaries. Three Chinese corpora using various character-encoding schemes and language styles were used in the experiments. The results show that retrieval using indexes based on the constructed dictionary is effective. This implies that fully automatic Chinese dictionary construction based on dynamic data sources, e.g., from the Internet, for the purposes of IR is feasible. Drawing on the experiment, we were able to make some interesting observations: (1) using only a portion of a dictionary is enough to produce good retrieval performance, e.g., a dictionary consisting of only the 500 highest-frequency strings extracted from the NTCIR 2 Chinese corpus produced as good a retrieval result as using a more complete dictionary with over 100K entries; and (2) complete word segmentation is not a strict requirement for achieving practical information retrieval.

Publisher

Association for Computing Machinery (ACM)

Subject

General Computer Science

Reference16 articles.

1. An unsupervised iterative method for Chinese new lexicon extraction;CHANG J.-S.;Comput. Linguist. Chinese Lang. Process.,1997

2. Unknown word detection for Chinese by a corpus-based learning method;CHEN K.-J.;Comput. Linguist. Chinese Lang. Process.,1998

3. PAT-tree-based adaptive key phrase extraction for intelligent Chinese information retrieval;CHIEN L.-F.;Inf. Process. Manage.,1999

4. Important issues on Chinese information retrieval;CHIEN L.-F.;Comput. Linguist. Chinese Lang. Process.,1996

Cited by 10 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Language discrimination by texture analysis of the image corresponding to the text;Neural Computing and Applications;2016-08-19

2. Research and Implementation of Intelligent Terminals Content Management System Based on Android Platform;Applied Mechanics and Materials;2013-09

3. Real Estate Management Information System;Lecture Notes in Electrical Engineering;2013

4. Improved N-grams Approach for Web Page Language Identification;Transactions on Computational Collective Intelligence V;2011

5. Introduction to Chinese Natural Language Processing;Synthesis Lectures on Human Language Technologies;2009-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3