Integrating XML data sources using approximate joins-Reference-Cited by-同舟云学术

Integrating XML data sources using approximate joins

Published:2006-03 Issue:1 Volume:31 Page:161-207
ISSN:0362-5915
Container-title:ACM Transactions on Database Systems
language:en
Short-container-title:ACM Trans. Database Syst.

Author:

Guha Sudipto¹,Jagadish H. V.²,Koudas Nick³,Srivastava Divesh⁴,Yu Ting⁵

Affiliation:

1. University of Pennsylvania

2. University of Michigan

3. University of Toronto

4. AT&T Labs--Research

5. North Carolina State University

Abstract

XML is widely recognized as the data interchange standard of tomorrow because of its ability to represent data from a variety of sources. Hence, XML is likely to be the format through which data from multiple sources is integrated. In this article, we study the problem of integrating XML data sources through correlations realized as join operations. A challenging aspect of this operation is the XML document structure. Two documents might convey approximately or exactly the same information but may be quite different in structure. Consequently, an approximate match in structure, in addition to content, has to be folded into the join operation. We quantify an approximate match in structure and content for pairs of XML documents using well defined notions of distance. We show how notions of distance that have metric properties can be incorporated in a framework for joins between XML data sources and introduce the idea of reference sets to facilitate this operation. Intuitively, a reference set consists of data elements used to project the data space. We characterize what constitutes a good choice of a reference set, and we propose sampling-based algorithms to identify them. We then instantiate our join framework using the tree edit distance between a pair of trees. We next turn our attention to utilizing well known index structures to improve the performance of approximate XML join operations. We present a methodology enabling adaptation of index structures for this problem, and we instantiate it in terms of the R-tree. We demonstrate the practical utility of our solutions using large collections of real and synthetic XML data sets, varying parameters of interest, and highlighting the performance benefits of our approach.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/1132863.1132868

Reference37 articles.

1. Apostolico A. and Galil Z. 1992. Pattern Matching Algorithms. Oxford University Press. Apostolico A. and Galil Z. 1992. Pattern Matching Algorithms. Oxford University Press.

2. The R*-tree: an efficient and robust access method for points and rectangles

3. Efficient processing of spatial joins using R-trees

Cited by 17 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Effective and efficient retrieval of structured entities;Proceedings of the VLDB Endowment;2020-02

2. An Efficient Classification of Fuzzy XML Documents Based on Kernel ELM;Information Systems Frontiers;2019-12-05

3. Fast Similarity Search for Graphs by Edit Distance;Cybernetics and Systems Analysis;2019-11

5. A methodology for measuring structure similarity of fuzzy XML documents;Computing;2017-04-10