Information Retrieval Based upon Latent Class Analysis

Author:

Baker Frank B.1

Affiliation:

1. University of Wisconsin, Laboratory of Experimental Design, Madison, Wisconsin

Abstract

The application of digital computers to the tasks of document classification, storage and retrieval holds considerable promise for solving the so-called “library problem.” Due to the high-speed and data handling characteristics of digital computers, a number of different approaches to the “library problem” have been placed in operation [4]. Although existing systems are rather rudimentary when compared with the ultimate goal of an automated library, progress towards that goal has been made in several areas: the organization of a mass of documents through automatic indexing schemes; the retrieval from a mass of documents of only those documents related to an information request made by a user of the library. A high proportion of existing document retrieval systems is based upon the author's background and skill rather than upon a mathematical model. Although allowing considerable success in the initial stages of development, the heuristic approach has a limited potential unless an underlying mathematical rationale can be found. Therefore, the present paper proposes an information retrieval based upon Lazarsfeld's latent class analysis [11], which has mathematical foundations. Although latent class analysis was developed by Lazarsfeld [11] to analyze questionnaires, the similarity of this task and document classification suggests that the mathematical rationale for the former could also provide a useful theoretical basis for the latter. Because the number of words contained in even a moderately sized report can exceed the capacity of most computers, some form of data reduction is a necessity. The reduction usually results in one of three levels of abstraction: abstracts of documents, key or topical words which convey the meaning of the document or abstract, and indices or tags based upon key words which are then assigned to the document. In general, indexing systems either assign key words to the document or use several key words to assign tags or indices to the documents. The key words or tags then serve as basic information for a retrieval system. Until a radical change in the data handling characteristics of computers is made, it would appear likely that key words or tags will continue to serve as the raw data for information retrieval systems. Although considerable uniformity exists in basic data introduced into an automated library, many different approaches exist as to the subsequent processing of the data. Several papers are reviewed below, which illustrate some of the considerations that enter into the development of an information retrieval system. Maron and Kuhns [8] have developed the “probabilistic indexing” scheme, which reduces the number of documents searched yet increases the retrieval of appropriate documents. In this approach, a large mass of source documents was read by human reviewers and key words were selected. The key words were then pooled into a few well-defined categories. However, any given key word could appear in more than one category. The resulting categories were then assigned meaningful labels or tags which constituted an index term list. The source documents were then re-inspected and the appropriate tag or tags assigned to the document. Document retrieval using the probabilistic indexing scheme is accomplished by presenting the computer with a series of tags and a value of a relevance number below which documents are not of sufficient importance to be retrieved. The tags locate the document, and the value of the corresponding relevance number compared to the lower bound value determines if the document should be retrieved. The high degree of dependence of the probabilistic indexing scheme upon human reviewers greatly reduces the efficiency of the method. If the number of documents, key words and tags were large, a human reviewer would not be able to maintain a consistent frame of reference when assigning tags and relevance numbers. The unique contribution of the probabilistic indexing scheme, however, is the use of relevance numbers in conjunction with the indices. The number provides a basis for determining the relevance of the stored documents to the indexed terms used by the requester of information. Stiles [10] had also reported the use of an association factor to accompany the index terms assigned to a document. The factor used expressed the discrepancy of the observed joint occurrence from the expected joint occurrence of an index pair, assuming independence. The association factor employed was the χ 2 value obtained from a two-fold contingency table involving the pair of index terms. A correlation coefficient, such as tetracortic r which expresses the correlation within the two-fold table, rather than a chi-square value expressing lack of independence would have been more appropriate in the present context. Stiles [10], however, reports that the use of the association factor was found to improve document retrieval. A more intensive study of the inter-relationships among words within a document was performed by Doyle [2]. The joint occurrences of word pairs in a body of 600 documents served as the basic data of the study. Two types of word correlations were found to exist within word pairs: adjacent correlations, resulting from words which appeared in pairs due to the nature of our language; and proximal correlation, due to words which are logically related but appear at non-adjacent positions within a document. The statistical effects of these two correlations were denoted by language redundancy and reality redundancy. In addition, a third type of redundancy, documentation redundancy resulted when more than one document could be classified by a given set of key words. The effect of language redundancy can be reduced by pooling adjacent key words and treating the pair as a single key word, thus eliminating the redundancy. Documentation redundancy would be reduced by pooling similar documents and assigning a single label to the batch, thus eliminating unnecessary duplication of effort. Reality redundancy, however, is the result of the author's cognitive processes, and the degree to which the literature researcher can duplicate this redundancy determines how successfully the original document can be retrieved. This study indicates that an important function in an information retrieval system is machinery for reducing the effects of language and documentation redundancy so that important relationships are not obscured. The results of the three studies reviewed above indicated document retrieval can be improved if the documents are surveyed for document redundancy and if the relationships among the key words are filtered to remove language redundancy. In addition, the use of a relevance number relating the document and key words appears to increase the efficiency of document retrieval.

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Hardware and Architecture,Information Systems,Control and Systems Engineering,Software

Cited by 26 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3