Table union search on open data-Reference-Cited by-同舟云学术

Table union search on open data

Published:2018-03 Issue:7 Volume:11 Page:813-825
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Nargesian Fatemeh¹,Zhu Erkang¹,Pu Ken Q.²,Miller Renée J.¹

Affiliation:

1. University of Toronto

2. UOIT

Abstract

We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Two tables are unionable if they share attributes from the same domain. Our solution formalizes three statistical models that describe how unionable attributes are generated from set domains, semantic domains with values from an ontology, and natural language domains. We propose a data-driven approach that automatically determines the best model to use for each pair of attributes. Through a distribution-aware algorithm, we are able to find the optimal number of attributes in two tables that can be unioned. To evaluate accuracy, we created and open-sourced a benchmark of Open Data tables. We show that our table union search outperforms in speed and accuracy existing algorithms for finding related tables and scales to provide efficient search over Open Data repositories containing more than one million attributes.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3192965.3192973

Cited by 123 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Pb-Hash: Partitioned b-bit Hashing;Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval;2024-08-02

2. A Large Scale Test Corpus for Semantic Table Search;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10

3. A multi-start simulated annealing strategy for Data Lake Organization Problem;Applied Soft Computing;2024-07

4. Enriching Relations with Additional Attributes for ER;Proceedings of the VLDB Endowment;2024-07

5. Causal Dataset Discovery with Large Language Models;Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics;2024-06-14