Affiliation:
1. Hasselt University and Transnational University of Limburg
2. Université Libre de Bruxelles
Abstract
Inferring an appropriate DTD or XML Schema Definition (XSD) for a given collection of XML documents essentially reduces to learning
deterministic
regular expressions from sets of positive example words. Unfortunately, there is no algorithm capable of learning the complete class of deterministic regular expressions from positive examples only, as we will show. The regular expressions occurring in practical DTDs and XSDs, however, are such that every alphabet symbol occurs only a small number of times. As such, in practice it suffices to learn the subclass of deterministic regular expressions in which each alphabet symbol occurs at most
k
times, for some small
k
. We refer to such expressions as
k
-occurrence regular expressions (
k
-OREs for short). Motivated by this observation, we provide a probabilistic algorithm that learns
k
-OREs for increasing values of k, and selects the deterministic one that best describes the sample based on a Minimum Description Length argument. The effectiveness of the method is empirically validated both on real world and synthetic data. Furthermore, the method is shown to be conservative over the simpler classes of expressions considered in previous work.
Funder
Seventh Framework Programme
Publisher
Association for Computing Machinery (ACM)
Subject
Computer Networks and Communications
Cited by
53 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. ReCG: Bottom-up JSON Schema Discovery Using a Repetitive Cluster-and-Generalize Framework;Proceedings of the VLDB Endowment;2024-07
2. InfeRE: Step-by-Step Regex Generation via Chain of Inference;2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE);2023-09-11
3. Schema inference for multi-model data;Proceedings of the 25th International Conference on Model Driven Engineering Languages and Systems;2022-10-23
4. Self-Adapting Design and Maintenance of Multi-Model Databases;International Database Engineered Applications Symposium;2022-08-22
5. A universal approach for multi-model schema inference;Journal of Big Data;2022-08-11