Abstract
We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Two tables are
unionable
if they share attributes from the same domain. Our solution formalizes three statistical models that describe how unionable attributes are generated from set domains, semantic domains with values from an ontology, and natural language domains. We propose a data-driven approach that automatically determines the best model to use for each pair of attributes. Through a distribution-aware algorithm, we are able to find the optimal number of attributes in two tables that can be unioned. To evaluate accuracy, we created and open-sourced a benchmark of Open Data tables. We show that our table union search outperforms in speed and accuracy existing algorithms for finding related tables and scales to provide efficient search over Open Data repositories containing more than one million attributes.
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Cited by
123 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. Pb-Hash: Partitioned b-bit Hashing;Proceedings of the 2024 ACM SIGIR International Conference on Theory of Information Retrieval;2024-08-02
2. A Large Scale Test Corpus for Semantic Table Search;Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval;2024-07-10
3. A multi-start simulated annealing strategy for Data Lake Organization Problem;Applied Soft Computing;2024-07
4. Enriching Relations with Additional Attributes for ER;Proceedings of the VLDB Endowment;2024-07
5. Causal Dataset Discovery with Large Language Models;Proceedings of the 2024 Workshop on Human-In-the-Loop Data Analytics;2024-06-14