Affiliation:
1. Department of Computer Science, Stanford, CA
Abstract
Many web documents (such as JAVA FAQs) are being replicated on the Internet. Often entire document collections (such as hyperlinked Linux manuals) are being replicated many times. In this paper, we make the case for identifying replicated documents and collections to improve web crawlers, archivers, and ranking functions used in search engines. The paper describes how to efficiently identify replicated documents and hyperlinked document collections. The challenge is to identify these replicas from an input data set of several tens of millions of web pages and several hundreds of gigabytes of textual data. We also present two real-life case studies where we used replication information to improve a crawler and a search engine. We report these results for a data set of 25 million web pages (about 150 gigabytes of HTML data) crawled from the web.
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems,Software
Reference13 articles.
1. Mirror, mirror on the Web: a study of host pairs with replicated content
2. Sergey Brin and Lawrence Page. Google search engine. http://www.google.com 1999. Sergey Brin and Lawrence Page. Google search engine. http://www.google.com 1999.
Cited by
30 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献