Optimal algorithms for crawling a hidden database in the web-Reference-Cited by-同舟云学术

Optimal algorithms for crawling a hidden database in the web

Published:2012-07 Issue:11 Volume:5 Page:1112-1123
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Sheng Cheng¹,Zhang Nan²,Tao Yufei³,Jin Xin²

Affiliation:

1. Chinese University of Hong Kong

2. George Washington University

3. Chinese University of Hong Kong and Korea Advanced Institute of Science and Technology

Abstract

A hidden database refers to a dataset that an organization makes accessible on the web by allowing users to issue queries through a search interface. In other words, data acquisition from such a source is not by following static hyper-links. Instead, data are obtained by querying the interface, and reading the result page dynamically generated. This, with other facts such as the interface may answer a query only partially, has prevented hidden databases from being crawled effectively by existing search engines. This paper remedies the problem by giving algorithms to extract all the tuples from a hidden database. Our algorithms are provably efficient, namely, they accomplish the task by performing only a small number of queries, even in the worst case. We also establish theoretical results indicating that these algorithms are asymptotically optimal -- i.e., it is impossible to improve their efficiency by more than a constant factor. The derivation of our upper and lower bound results reveals significant insight into the characteristics of the underlying problem. Extensive experiments confirm the proposed techniques work very well on all the real datasets examined.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2350229.2350232

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Data distribution tailoring revisited: cost-efficient integration of representative data;The VLDB Journal;2024-04-12

2. Decision tree Thompson sampling for mining hidden populations through attributed search;Social Network Analysis and Mining;2021-11-15

3. A third-party replication service for dynamic hidden databases;Service Oriented Computing and Applications;2021-01-08

4. CRUX;Proceedings of the 28th ACM International Conference on Information and Knowledge Management;2019-11-03

5. Social Security and Privacy for Social IoT Polymorphic Value Set: A Solution to Inference Attacks on Social Networks;Security and Communication Networks;2019-08-28