Efficient Enumeration Algorithms for Regular Document Spanners-Reference-Cited by-同舟云学术

Efficient Enumeration Algorithms for Regular Document Spanners

Published:2020-03-03 Issue:1 Volume:45 Page:1-42
ISSN:0362-5915
Container-title:ACM Transactions on Database Systems
language:en
Short-container-title:ACM Trans. Database Syst.

Author:

Florenzano Fernando¹,Riveros Cristian¹,Ugarte Martín²,Vansummeren Stijn³,Vrgoč Domagoj¹

Affiliation:

1. Pontificia Universidad Católica de Chile and IMFD Chile, Macul, Santiago, Chile

2. Université Libre de Bruxelles and IMFD Chile, Macul, Santiago, Chile

3. Université Libre de Bruxelles, Brussels, Belgium

Abstract

Regular expressions and automata models with capture variables are core tools in rule-based information extraction. These formalisms, also called regular document spanners , use regular languages to locate the data that a user wants to extract from a text document and then store this data into variables. Since document spanners can easily generate large outputs, it is important to have efficient evaluation algorithms that can generate the extracted data in a quick succession, and with relatively little precomputation time. Toward this goal, we present a practical evaluation algorithm that allows output-linear delay enumeration of a spanner’s result after a precomputation phase that is linear in the document. Although the algorithm assumes that the spanner is specified in a syntactic variant of variable-set automata, we also study how it can be applied when the spanner is specified by general variable-set automata, regex formulas, or spanner algebras. Finally, we study the related problem of counting the number of outputs of a document spanner and provide a fine-grained analysis of the classes of document spanners that support efficient enumeration of their results.

Funder

Fondo Nacional de Desarrollo Científico y Tecnológico

Innoviris

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3351451

Reference28 articles.

1. A very hard log-space counting class

2. A framework for annotating CSV-like data

3. Lecture Notes in Computer Science;Bagan Guillaume

Cited by 14 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Demonstrating REmatch: A Novel RegEx Engine for Finding all Matches;Companion of the 2024 International Conference on Management of Data;2024-06-09

2. Generalized Core Spanner Inexpressibility via Ehrenfeucht-Fraïssé Games for FC;Proceedings of the ACM on Management of Data;2024-05-10

3. The Information Extraction Framework of Document Spanners - A Very Informal Survey;Lecture Notes in Computer Science;2024

4. Modeling Regex Operators for Solving Regex Crossword Puzzles;Dependable Software Engineering. Theories, Tools, and Applications;2023-12-15

5. REmatch: A Novel Regex Engine for Finding All Matches;Proceedings of the VLDB Endowment;2023-07