IRLbot-Reference-Cited by-同舟云学术

IRLbot

Published:2009-06 Issue:3 Volume:3 Page:1-34
ISSN:1559-1131
Container-title:ACM Transactions on the Web
language:en
Short-container-title:ACM Trans. Web

Author:

Lee Hsin-Tsang¹,Leonard Derek¹,Wang Xiaoming¹,Loguinov Dmitri¹

Affiliation:

1. Texas A&M University, College Station, TX

Abstract

This article shares our experience in designing a Web crawler that can download billions of pages using a single-server implementation and models its performance. We first show that current crawling algorithms cannot effectively cope with the sheer volume of URLs generated in large crawls, highly branching spam, legitimate multimillion-page blog sites, and infinite loops created by server-side scripts. We then offer a set of techniques for dealing with these issues and test their performance in an implementation we call IRLbot. In our recent experiment that lasted 41 days, IRLbot running on a single server successfully crawled 6.3 billion valid HTML pages (7.6 billion connection requests) and sustained an average download rate of 319 mb/s (1,789 pages/s). Unlike our prior experiments with algorithms proposed in related work, this version of IRLbot did not experience any bottlenecks and successfully handled content from over 117 million hosts, parsed out 394 billion links, and discovered a subset of the Web graph with 41 billion unique nodes.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/1541822.1541823

Reference39 articles.

1. Adaptive on-line page importance computation

2. Searching the Web

3. UbiCrawler: a scalable fully distributed Web crawler

Cited by 202 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Combining unsupervised constraints on weakly supervised semantic segmentation of skin cancer;Biomedical Physics & Engineering Express;2024-08-12

2. 3D object recognition using deep learning for automatically generating semantic BIM data;Automation in Construction;2024-06

3. DDoS attack detection in smart grid network using reconstructive machine learning models;PeerJ Computer Science;2024-01-09

4. AudioMNIST: Exploring Explainable Artificial Intelligence for audio analysis on a simple benchmark;Journal of the Franklin Institute;2024-01

5. Open Set Domain Adaptation for Classification of Dynamical States in Nonlinear Fluid Dynamical Systems;IEEE Access;2024