Affiliation:
1. AWS AI Labs
2. MIT
3. University of Pennsylvania
Abstract
Exploratory data science is driving new platforms that assist data scientists with everyday tasks, such as integration and wrangling, to assemble training datasets. Such tools take scientists' work-in-progress data as a
search object
(table or JSON) and find relevant supplementary data from an organizational
data lake
, which can be unioned or joined with the current data. Existing data lake search tools find
single
, relational tables to match or join with a search object. Yet many data science applications revolve around hierarchical data, which can only be matched by creating views that simultaneously
join and transform several
tables in the data lake. In this paper, we extend the Juneau data lake search system [46] for this broader class of matches
at scale.
Our contribution is a
general
framework for efficiently merging ranked results to match hierarchical data, leveraging novel techniques for indexing and sketching, and incorporating existing single-table search techniques and ranking functions. We experimentally validate our methods' benefits and broad applicability using real data from data science computational notebooks. Our results indicate that, with different ranking functions, our approach can return the optimal set of views up to 4.8x faster and 43% more related compared to heuristics, and increase the data domain coverage by up to 28%. In a case study to show the utility of our results to data science downstream tasks, we reduce regression error by up to 6.6%, and improve classification accuracy by up to 19.5%.
Publisher
Association for Computing Machinery (ACM)
Reference48 articles.
1. Rakesh Agrawal, Ramakrishnan Srikant, et al. 1994. Fast algorithms for mining association rules. In Proceedings of 20th VLDB Conference, Vol. 1215. Citeseer, 487--499.
2. Dataset Discovery in Data Lakes
3. Google Dataset Search: Building a search engine for datasets in an open Web ecosystem
4. Ten years of webtables
5. WebTables