Divide & conquer-based inclusion dependency discovery-Reference-Cited by-同舟云学术

Divide & conquer-based inclusion dependency discovery

Published:2015-02 Issue:7 Volume:8 Page:774-785
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Papenbrock Thorsten¹,Kruse Sebastian¹,Quiané-Ruiz Jorge-Arnulfo²,Naumann Felix¹

Affiliation:

1. Hasso Plattner Institute (HPI), Potsdam, Germany

2. Qatar Computing Research Institute (QCRI), Doha, Qatar

Abstract

The discovery of all inclusion dependencies (INDs) in a dataset is an important part of any data profiling effort. Apart from the detection of foreign key relationships, INDs can help to perform data integration, query optimization, integrity checking, or schema (re-)design. However, the detection of INDs gets harder as datasets become larger in terms of number of tuples as well as attributes. To this end, we propose Binder, an IND detection system that is capable of detecting both unary and n-ary INDs. It is based on a divide & conquer approach, which allows to handle very large datasets -- an important property on the face of the ever increasing size of today's data. In contrast to most related works, we do not rely on existing database functionality nor assume that inspected datasets fit into main memory. This renders Binder an efficient and scalable competitor. Our exhaustive experimental evaluation shows the high superiority of Binder over the state-of-the-art in both unary (Spider) and n-ary (Mind) IND discovery. Binder is up to 26x faster than Spider and more than 2500x faster than Mind.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/2752939.2752946

Cited by 42 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. An efficient approach for discovering Graph Entity Dependencies (GEDs);Information Systems;2024-11

2. Determining the Largest Overlap between Tables;Proceedings of the ACM on Management of Data;2024-03-12

3. PLM data transformation: A mesoscopic scale perspective and an industrial case study;Computers in Industry;2024-02

4. Enhancing AI System Privacy: An Automatic Tool for Achieving GDPR Compliance in NoSQL Databases;Computers, Materials & Continua;2024

5. Can Large Language Models Predict Data Correlations from Column Names?;Proceedings of the VLDB Endowment;2023-09