Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data-Reference-Cited by-同舟云学术

Learning Deterministic Regular Expressions for the Inference of Schemas from XML Data

Published:2010-09 Issue:4 Volume:4 Page:1-32
ISSN:1559-1131
Container-title:ACM Transactions on the Web
language:en
Short-container-title:ACM Trans. Web

Author:

Bex Geert Jan¹,Gelade Wouter¹,Neven Frank¹,Vansummeren Stijn²

Affiliation:

1. Hasselt University and Transnational University of Limburg

2. Université Libre de Bruxelles

Abstract

Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning deterministic regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most k times, for some small k . We refer to such expressions as k -occurrence regular expressions ( k -OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns k -OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.

Funder

Seventh Framework Programme

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Networks and Communications

Link

https://dl.acm.org/doi/pdf/10.1145/1841909.1841911

Reference52 articles.

1. Inductive Inference: Theory and Methods

2. Studying the XML Web: Gathering Statistics from an XML Sample

3. XPath satisfiability in the presence of DTDs

Cited by 53 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework;Proceedings of the VLDB Endowment;2024-07

2. InfeRE: Step-by-Step Regex Generation via Chain of Inference;2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE);2023-09-11

3. Schema inference for multi-model data;Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems;2022-10-23

4. Self-Adapting Design and Maintenance of Multi-Model Databases;International Database Engineered Applications Symposium;2022-08-22

5. A universal approach for multi-model schema inference;Journal of Big Data;2022-08-11