The automatic identification of stop words-Reference-Cited by-同舟云学术

The automatic identification of stop words

Published:1992-02 Issue:1 Volume:18 Page:45-55
ISSN:0165-5515
Container-title:Journal of Information Science
language:en
Short-container-title:Journal of Information Science

Author:

Wilbur W. John¹,Sirotkin Karl¹

Affiliation:

1. National Center for Biotechnology Information, Bethesda. MD, USA

Abstract

A stop word may be identified as a word that has the same likehhood of occurring in those documents not relevant to a query as in those documents relevant to the query. In this paper we show how the concept of relevance may be replaced by the condition of being highly rated by a similarity measure. Thus it becomes possible to identify the stop words in a cullectmn by automated statistical testing. We describe the nature of the statistical test as it is realized with a vector retrieval methodology based on the cosine coefficient of document-document similarity. As an example, this tech nique is then applied to a large MEDLINE " subset in the area of biotechnology. The initial processing of this datahase involves a 310 word stop list of common non-content terms. Our technique is then applied and 75% of the remaining terms are identified as stop words. We compare retrieval with and without the removal of these stop words and find that of the top twenty documents retrieved in response to a random query document. seventeen of these are the same on the average for the two methods We also examine the differences and conclude that where the user prefers one method over the other, the new method with the reduced term set is favored about three times out of four.

Publisher

SAGE Publications

Subject

Library and Information Sciences,Information Systems

Link

http://journals.sagepub.com/doi/pdf/10.1177/016555159201800106

Reference14 articles.

1. A document retrieval system based on nearest neighbour searching

2. An algorithm for suffix stripping

Cited by 167 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A novel redistribution-based feature selection for text classification;Expert Systems with Applications;2024-07

2. An Empirical Analysis of Rebalancing Methods for Security Issue Report Identification;2023 IEEE 28th Pacific Rim International Symposium on Dependable Computing (PRDC);2023-10-24

3. Statistical clustering of documents via stochastic blockmodels;Journal of Applied Statistics;2023-09-01

4. HBDFA: An intelligent nature-inspired computing with high-dimensional data analytics;Multimedia Tools and Applications;2023-06-29

5. Ontology based Feature Selection and Weighting for Text classification using Machine Learning;Journal of Information Technology and Computing;2023-06-27