Affiliation:
1. State Key Laboratory of Computer Science , Institute of Software, Chinese Academy of Sciences
Abstract
Abstract
The presence of a schema for eXtensible Markup Language (XML) documents has numerous advantages. Unfortunately, many XML documents in practice are not accompanied by a (valid) schema. Therefore, it is essential to devise algorithms to infer schemas from XML documents, where the fundamental task is learning regular expressions. In this paper, we focus on the learning of disjunctive multiplicity expressions (DMEs), a subclass of regular expressions that are particularly suitable to specify unordered models and have been used as the foundation of the schemas for unordered XML. Previous work for learning DME lacks inference algorithms that support positive and negative examples. Further, presently there has been no algorithm can learn DMEs extended with numeric occurrences. We address these challenges in the present paper and first propose a novel algorithm to learn DMEs from positive and negative examples by using genetic algorithms and parallel techniques. Then we extend DMEs to disjunctive generalized multiplicity expressions (DGMEs), which allow numeric occurrences and develop an algorithm to learn DGMEs from positive and negative examples. Finally, experimental results show that with only positive examples, our algorithm can generate a DME with an acceptable learning time, which can accept all positive examples, and when given both positive and negative examples, we can learn DMEs or DGMEs with high accuracy.
Funder
National Natural Science Foundation of China
Publisher
Oxford University Press (OUP)
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献