Random sampling from a search engine's index-Reference-Cited by-同舟云学术

Random sampling from a search engine's index

Published:2008-10 Issue:5 Volume:55 Page:1-74
ISSN:0004-5411
Container-title:Journal of the ACM
language:en
Short-container-title:J. ACM

Author:

Bar-Yossef Ziv¹,Gurevich Maxim²

Affiliation:

1. Technion and Google Haifa, Haifa, Israel

2. Technion, Haifa, Israel

Abstract

We revisit a problem introduced by Bharat and Broder almost a decade ago: How to sample random pages from the corpus of documents indexed by a search engine, using only the search engine's public interface? Such a primitive is particularly useful in creating objective benchmarks for search engines. The technique of Bharat and Broder suffers from a well-recorded bias: it favors long documents. In this article we introduce two novel sampling algorithms: a lexicon-based algorithm and a random walk algorithm. Our algorithms produce biased samples, but each sample is accompanied by a weight , which represents its bias. The samples, in conjunction with the weights, are then used to simulate near-uniform samples. To this end, we resort to four well-known Monte Carlo simulation methods: rejection sampling , importance sampling , the Metropolis--Hastings algorithm, and the Maximum Degree method. The limited access to search engines force our algorithms to use bias weights that are only “approximate”. We characterize analytically the effect of approximate bias weights on Monte Carlo methods and conclude that our algorithms are guaranteed to produce near-uniform samples from the search engine's corpus. Our study of approximate Monte Carlo methods could be of independent interest. Experiments on a corpus of 2.4 million documents substantiate our analytical findings and show that our algorithms do not have significant bias towards long documents. We use our algorithms to collect comparative statistics about the corpora of the Google, MSN Search, and Yahoo! search engines.

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Hardware and Architecture,Information Systems,Control and Systems Engineering,Software

Link

https://dl.acm.org/doi/pdf/10.1145/1411509.1411514

Reference41 articles.

1. On the Markov Chain Simulation Method for Uniform Combinatorial Distributions and Simulated Annealing

2. Sampling search-engine results

3. Efficient search engine measurements

4. Battelle J. 2005. John Battelle's searchblog. http://battellemedia.com/archives/001889.php. Battelle J. 2005. John Battelle's searchblog. http://battellemedia.com/archives/001889.php.

Cited by 61 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Weighted Jump in Random Walk graph sampling;Neurocomputing;2024-06

2. Sampling Individually-Fair Rankings that are Always Group Fair;Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society;2023-08-08

3. Empirical characterization of graph sampling algorithms;Social Network Analysis and Mining;2023-04-08

4. Early warning indicators of epidemics on a coupled behaviour-disease model with vaccine hesitance and incomplete data;Journal of Dynamics and Games;2023

5. CS- and GA-based hybrid evolutionary sampling algorithm for large-scale social networks;Social Network Analysis and Mining;2021-11-09