Affiliation:
1. San Francisco State University, San Francisco, CA
2. Carnegie Mellon University, Pittsburgh, PA
Abstract
The traditional search solution for large collections divides the collection into subsets (
shards
), and processes the query against all shards in parallel (
exhaustive search
). The search cost and the computational requirements of this approach are often prohibitively high for organizations with few computational resources. This article investigates and extends an alternative:
selective search
, an approach that partitions the dataset based on document similarity to obtain
topic-based shards
, and searches only a few shards that are estimated to contain relevant documents for the query. We propose shard creation techniques that are scalable, efficient, self-reliant, and create topic-based shards with low variance in size, and high density of relevant documents.
The experimental results demonstrate that the effectiveness of selective search is on par with that of exhaustive search, and the corresponding search costs are substantially lower with the former. Also, the majority of the queries perform as well or better with selective search. An oracle experiment that uses optimal shard ranking for a query indicates that selective search can outperform the effectiveness of exhaustive search. Comparison with a query optimization technique shows higher improvements in efficiency with selective search. The overall best efficiency is achieved when the two techniques are combined in an optimized selective search approach.
Funder
National Science Foundation
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Science Applications,General Business, Management and Accounting,Information Systems
Cited by
35 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献