Document Spanners-Reference-Cited by-同舟云学术

Document Spanners

Published:2015-05-06 Issue:2 Volume:62 Page:1-51
ISSN:0004-5411
Container-title:Journal of the ACM
language:en
Short-container-title:J. ACM

Author:

Fagin Ronald¹,Kimelfeld Benny²,Reiss Frederick¹,Vansummeren Stijn³

Affiliation:

1. IBM Research -- Almaden, San Jose, CA

2. IBM Research -- Almaden

3. Université Libre de Bruxelles (ULB), Brussels, Belgium

Abstract

An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this article, we develop a foundational framework where the central construct is what we call a document spanner (or just spanner for short). A spanner maps an input string into a relation over the spans (intervals specified by bounding indices) of the string. The focus of this article is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables. We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners—the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.

Publisher

Association for Computing Machinery (ACM)

Subject

Artificial Intelligence,Hardware and Architecture,Information Systems,Control and Systems Engineering,Software

Link

https://dl.acm.org/doi/pdf/10.1145/2699442

Reference47 articles.

1. Maintaining knowledge about temporal intervals

2. The common pattern specification language

3. Consistent query answers in inconsistent databases

4. Graph Logics with Rational Relations and the Generalized Intersection Problem

Cited by 60 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Automatic Extraction and Cluster Analysis of Natural Disaster Metadata Based on the Unified Metadata Framework;ISPRS International Journal of Geo-Information;2024-06-14

2. Demonstrating REmatch: A Novel RegEx Engine for Finding all Matches;Companion of the 2024 International Conference on Management of Data;2024-06-09

3. Mitigating Data Sparsity in Integrated Data through Text Conceptualization;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

4. Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC;Proceedings of the ACM on Management of Data;2024-05-10

5. The Information Extraction Framework of Document Spanners - A Very Informal Survey;Lecture Notes in Computer Science;2024