Tractable near-optimal policies for crawling


Azar Yossi,Horvitz Eric,Lubetzky Eyal,Peres Yuval,Shahaf Dafna


The problem of maintaining a local cache of n constantly changing pages arises in multiple mechanisms such as web crawlers and proxy servers. In these, the resources for polling pages for possible updates are typically limited. The goal is to devise a polling and fetching policy that maximizes the utility of served pages that are up to date. Cho and Garcia-Molina [(2003) ACM Trans Database Syst 28:390–426] formulated this as an optimization problem, which can be solved numerically for small values of n, but appears intractable in general. Here, we show that the optimal randomized policy can be found exactly in O(nlogn) operations. Moreover, using the optimal probabilities to define in linear time a deterministic schedule yields a tractable policy that in experiments attains 99% of the optimum.


Proceedings of the National Academy of Sciences



Reference17 articles.

1. Effective page refresh policies for web crawlers;Cho;ACM Trans Database Syst,2003

2. Cho J Garcia-Molina H (2000) Synchronizing a database to improve freshness. Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, TX (Association for Computing Machinery, New York), 117–128.

3. Estimating frequency of change;Cho;ACM Trans Internet Technol,2003

4. Optimal robot scheduling for web search engines;Coffman;J Scheduling,1998

5. Castillo C (2004) Effective web crawling. PhD thesis (University of Chile, Santiago). Available at .

Cited by 15 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献







Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3