Searching Data Lakes for Nested and Joined Data-Reference-Cited by-同舟云学术

Searching Data Lakes for Nested and Joined Data

Published:2024-07 Issue:11 Volume:17 Page:3346-3359
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Zhang Yi¹,Chen Peter Baile²,Ives Zachary G.³

Affiliation:

1. AWS AI Labs

2. MIT

3. University of Pennsylvania

Abstract

Exploratory data science is driving new platforms that assist data scientists with everyday tasks, such as integration and wrangling, to assemble training datasets. Such tools take scientists' work-in-progress data as a search object (table or JSON) and find relevant supplementary data from an organizational data lake , which can be unioned or joined with the current data. Existing data lake search tools find single , relational tables to match or join with a search object. Yet many data science applications revolve around hierarchical data, which can only be matched by creating views that simultaneously join and transform several tables in the data lake. In this paper, we extend the Juneau data lake search system [46] for this broader class of matches at scale. Our contribution is a general framework for efficiently merging ranked results to match hierarchical data, leveraging novel techniques for indexing and sketching, and incorporating existing single-table search techniques and ranking functions. We experimentally validate our methods' benefits and broad applicability using real data from data science computational notebooks. Our results indicate that, with different ranking functions, our approach can return the optimal set of views up to 4.8x faster and 43% more related compared to heuristics, and increase the data domain coverage by up to 28%. In a case study to show the utility of our results to data science downstream tasks, we reduce regression error by up to 6.6%, and improve classification accuracy by up to 19.5%.

Publisher

Association for Computing Machinery (ACM)

Link

https://dl.acm.org/doi/pdf/10.14778/3681954.3682005

Reference48 articles.

1. Rakesh Agrawal, Ramakrishnan Srikant, et al. 1994. Fast algorithms for mining association rules. In Proceedings of 20th VLDB Conference, Vol. 1215. Citeseer, 487--499.

2. Dataset Discovery in Data Lakes

3. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem

4. Ten years of webtables

5. WebTables