Towards a query optimizer for text-centric tasks

Author:

Ipeirotis Panagiotis G.1,Agichtein Eugene2,Jain Pranay3,Gravano Luis3

Affiliation:

1. New York University, New York, NY

2. Emory University, Atlanta, GA

3. Columbia University, New York, NY

Abstract

Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or “crawl,” the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output “completeness” (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Reference51 articles.

1. Scaling information extraction to large document collections;Agichtein E.;IEEE Data Eng. Bull.,2005

2. Snowball

3. Eddies

Cited by 17 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

1. Autonomously Computable Information Extraction;Proceedings of the VLDB Endowment;2023-06

2. Predicting Document Coverage for Relation Extraction;Transactions of the Association for Computational Linguistics;2022

3. Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases;Foundations and Trends® in Databases;2021

4. Keyword query based focused Web crawler;Procedia Computer Science;2018

5. A survey of Web crawlers for information retrieval;WIREs Data Mining and Knowledge Discovery;2017-08-07

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3