Do not crawl in the DUST-Reference-Cited by-同舟云学术

Do not crawl in the DUST

Published:2009-01 Issue:1 Volume:3 Page:1-31
ISSN:1559-1131
Container-title:ACM Transactions on the Web
language:en
Short-container-title:ACM Trans. Web

Author:

Bar-Yossef Ziv¹,Keidar Idit¹,Schonfeld Uri²

Affiliation:

1. Technion Israel Institute of Technology, Haifa, Israel

2. University of California Los Angeles, CA

Abstract

We consider the problem of DUST: Different URLs with Similar Text. Such duplicate URLs are prevalent in Web sites, as Web server software often uses aliases and redirections, and dynamically generates the same page from various different URL requests. We present a novel algorithm, DustBuster , for uncovering DUST; that is, for discovering rules that transform a given URL to others that are likely to have similar content. DustBuster mines DUST effectively from previous crawl logs or Web server logs, without /examining page contents. Verifying these rules via sampling requires fetching few actual Web pages. Search engines can benefit from information about DUST to increase the effectiveness of crawling, reduce indexing overhead, and improve the quality of popularity statistics such as PageRank.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/1462148.1462151

Reference29 articles.

1. Apache 2008. Apache. http server version 2.2 configuration files. http://httpd.apache.org/docs/2.2/configuring.html. Apache 2008. Apache. http server version 2.2 configuration files. http://httpd.apache.org/docs/2.2/configuring.html.

2. Analog. 2008. Analog homepage. http://www.analog.cx/. Analog. 2008. Analog homepage. http://www.analog.cx/.

3. Berners-Lee T. Fielding R. and Masinter L. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt. Berners-Lee T. Fielding R. and Masinter L. Uniform resource identifiers (URI): Generic syntax. http://www.ietf.org/rfc/rfc2396.txt.

4. Mirror, mirror on the Web: a study of host pairs with replicated content

Cited by 23 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Detecting News Influence in a Country: One Step Forward Towards Understanding Fake News;Studies in Computational Intelligence;2021-12-16

2. DSDD;Proceedings of the 30th ACM International Conference on Information & Knowledge Management;2021-10-26

3. A fast text similarity measure for large document collections using multireferencecosine and genetic algorithm;TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES;2020-03-28

4. NestMSA: a new multiple sequence alignment algorithm;The Journal of Supercomputing;2020-02-19

5. Search Engine Similarity Analysis: A Combined Content and Rankings Approach;Web Information Systems Engineering – WISE 2020;2020