Constant-Delay Enumeration for Nondeterministic Document Spanners-Reference-Cited by-同舟云学术

Constant-Delay Enumeration for Nondeterministic Document Spanners

Published:2021-04 Issue:1 Volume:46 Page:1-30
ISSN:0362-5915
Container-title:ACM Transactions on Database Systems
language:en
Short-container-title:ACM Trans. Database Syst.

Author:

Amarilli Antoine¹,Bourhis Pierre²,Mengel Stefan³,Niewerth Matthias⁴

Affiliation:

1. LTCI and Télécom Paris and Institut polytechnique de Paris

2. CRIStAL and CNRS UMR 9189 and Inria Lille

3. CRIL, CNRS & Univ Artois

4. University of Bayreuth

Abstract

We consider the information extraction framework known as document spanners and study the problem of efficiently computing the results of the extraction from an input document, where the extraction task is described as a sequential variable-set automaton (VA). We pose this problem in the setting of enumeration algorithms, where we can first run a preprocessing phase and must then produce the results with a small delay between any two consecutive results. Our goal is to have an algorithm that is tractable in combined complexity, i.e., in the sizes of the input document and the VA, while ensuring the best possible data complexity bounds in the input document size, i.e., constant delay in the document size. Several recent works at PODS’18 proposed such algorithms but with linear delay in the document size or with an exponential dependency in size of the (generally nondeterministic) input VA. In particular, Florenzano et al. suggest that our desired runtime guarantees cannot be met for general sequential VAs. We refute this and show that, given a nondeterministic sequential VA and an input document, we can enumerate the mappings of the VA on the document with the following bounds: the preprocessing is linear in the document size and polynomial in the size of the VA, and the delay is independent of the document and polynomial in the size of the VA. The resulting algorithm thus achieves tractability in combined complexity and the best possible data complexity bounds. Moreover, it is rather easy to describe, particularly for the restricted case of so-called extended VAs. Finally, we evaluate our algorithm empirically using a prototype implementation.

Funder

ANR

DFG

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3436487

Reference32 articles.

1. Enumeration on Trees with Tractable Combined Complexity and Efficient Updates

Cited by 16 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Enumerating m-Length Walks in Directed Graphs with Constant Delay;Lecture Notes in Computer Science;2024

2. The Information Extraction Framework of Document Spanners - A Very Informal Survey;Lecture Notes in Computer Science;2024

3. Enumerating grammar-based extractions;Discrete Applied Mathematics;2023-12

4. Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries;Logical Methods in Computer Science;2023-08-09

5. The Complexity of Aggregates over Extractions by Regular Expressions;Logical Methods in Computer Science;2023-08-09