Google's Deep Web crawl-Reference-Cited by-同舟云学术

Google's Deep Web crawl

Published:2008-08 Issue:2 Volume:1 Page:1241-1252
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Madhavan Jayant¹,Ko David¹,Kot Łucja²,Ganapathy Vignesh¹,Rasmussen Alex³,Halevy Alon¹

Affiliation:

1. Google Inc.

2. Cornell University

3. University of California, San Diego

Abstract

The Deep Web, i.e., content hidden behind HTML forms, has long been acknowledged as a significant gap in search engine coverage. Since it represents a large portion of the structured data on the Web, accessing Deep-Web content has been a long-standing challenge for the database community. This paper describes a system for surfacing Deep-Web content, i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index. The results of our surfacing have been incorporated into the Google search engine and today drive more than a thousand queries per second to Deep-Web content. Surfacing the Deep Web poses several challenges. First, our goal is to index the content behind many millions of HTML forms that span many languages and hundreds of domains. This necessitates an approach that is completely automatic, highly scalable, and very efficient. Second, a large number of forms have text inputs and require valid inputs values to be submitted. We present an algorithm for selecting input values for text search inputs that accept keywords and an algorithm for identifying inputs which accept only values of a specific type. Third, HTML forms often have more than one input and hence a naive strategy of enumerating the entire Cartesian product of all possible inputs can result in a very large number of URLs being generated. We present an algorithm that efficiently navigates the search space of possible input combinations to identify only those that generate URLs suitable for inclusion into our web search index. We present an extensive experimental evaluation validating the effectiveness of our algorithms.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/1454159.1454163

Cited by 116 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data distribution tailoring revisited: cost-efficient integration of representative data;The VLDB Journal;2024-04-12

2. FLASH;Robotic Process Automation;2023-08-25

3. Synthesis of multilevel knowledge graphs: Methods and technologies for dynamic networks;Engineering Applications of Artificial Intelligence;2023-08

4. Effective Entity Augmentation by Querying External Data Sources;Proceedings of the VLDB Endowment;2023-07

5. The Deep Web;Understanding Search Engines;2023