Aggregation consistency and frequency of Chinese words and characters

Author:

Arsenault Clément

Abstract

PurposeAims to measure syllable aggregation consistency of Romanized Chinese data in the title fields of bibliographic records. Also aims to verify if the term frequency distributions satisfy conventional bibliometric laws.Design/methodology/approachUses Cooper's interindexer formula to evaluate aggregation consistency within and between two sets of Chinese bibliographic data. Compares the term frequency distributions of polysyllabic words and monosyllabic characters (for vernacular and Romanized data) with the Lotka and the generalised Zipf theoretical distributions. The fits are tested with the Kolmogorov‐Smirnov test.FindingsFinds high internal aggregation consistency within each data set but some aggregation discrepancy between sets. Shows that word (polysyllabic) distributions satisfy Lotka's law but that character (monosyllabic) distributions do not abide by the law.Research limitations/implicationsThe findings are limited to only two sets of bibliographic data (for aggregation consistency analysis) and to one set of data for the frequency distribution analysis. Only two bibliometric distributions are tested. Internal consistency within each database remains fairly high. Therefore the main argument against syllable aggregation does not appear to hold true. The analysis revealed that Chinese words and characters behave differently in terms of frequency distribution but that there is no noticeable difference between vernacular and Romanized data. The distribution of Romanized characters exhibits the worst case in terms of fit to either Lotka's or Zipf's laws, which indicates that Romanized data in aggregated form appear to be a preferable option.Originality/valueProvides empirical data on consistency and distribution of Romanized Chinese titles in bibliographic records.

Publisher

Emerald

Subject

Library and Information Sciences,Information Systems

Reference27 articles.

1. Arsenault, C. (2001), “Word division in the transcription of Chinese script in the title fields of bibliographic records”, Cataloging and Classification Quarterly, Vol. 32 No. 3, pp. 109‐37.

2. Arsenault, C. (2002a), “Analyse de la consistance dans l'agrégation des transcriptions pinyin polysyllabiques dans les bases bibliographiques”, CJILS/RCSIB, Vol. 26 Nos 2/3, pp. 91‐106.

3. Arsenault, C. (2002b), “Pinyin Romanization for OPAC retrieval: is everyone being served?”, Information Technology and Libraries, Vol. 21 No. 2, pp. 45‐50.

4. Arsenault, C. (2004), “Measuring and comparing aggregation inconsistency for Chinese titles in two library catalogues”, Proceedings of the CAIS/ACSI Annual Conference, Winnipeg, Manitoba, Canada, 3‐5 June, available at: www.cais‐acsi.ca/proceedings/2004/arsenault_2004.pdf.

5. Chen, Z. and Lee, K.F. (2000), “A new statistical approach to Chinese pinyin input”, Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL '00), ACL, Hong Kong, pp. 241‐7, available at: research.microsoft.com/china/papers/Statistical_Chinese_Pinyin_Input.pdf.

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Zipf’s Law and World Military Expenditures;Peace Economics, Peace Science and Public Policy;2016-01-01

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3