QProber-Reference-Cited by-同舟云学术

QProber

Published:2003-01 Issue:1 Volume:21 Page:1-41
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Gravano Luis¹,Ipeirotis Panagiotis G.¹,Sahami Mehran²

Affiliation:

1. Columbia University, Amsterdam, New York, NY

2. Stanford University, Stanford, CA

Abstract

The contents of many valuable Web-accessible databases are only available through search interfaces and are hence invisible to traditional Web "crawlers." Recently, commercial Web sites have started to manually organize Web-accessible databases into Yahoo!-like hierarchical classification schemes. Here we introduce QProber, a modular system that automates this classification process by using a small number of query probes, generated by document classifiers. QProber can use a variety of types of classifiers to generate the probes. To classify a database, QProber does not retrieve or inspect any documents or pages from the database, but rather just exploits the number of matches that each query probe generates at the database in question. We have conducted an extensive experimental evaluation of QProber over collections of real documents, experimenting with different types of document classifiers and retrieval models. We have also tested our system with over one hundred Web-accessible databases. Our experiments show that our system has low overhead and achieves high classification accuracy across a variety of databases.

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/635484.635485

Reference53 articles.

1. Automated learning of decision rules for text categorization

Cited by 58 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Design of a Parallel and Scalable Crawler for the Hidden Web;International Journal of Information Retrieval Research;2022-01

2. Searching Digital Libraries;Encyclopedia of Database Systems;2018

3. Optimal Query Generation for Hidden Web Extraction Through Response Analysis;The Dark Web;2018

4. A survey of Web crawlers for information retrieval;WIREs Data Mining and Knowledge Discovery;2017-08-07

5. Sampling strategies for information extraction over the deep web;Information Processing & Management;2017-03