Discovering related data at scale-Reference-Cited by-同舟云学术

Discovering related data at scale

Published:2021-04 Issue:8 Volume:14 Page:1392-1400
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Bharadwaj Sagar¹,Gupta Praveen¹,Bhagwan Ranjita¹,Guha Saikat¹

Affiliation:

1. Microsoft Research

Abstract

Analysts frequently require data from multiple sources for their tasks, but finding these sources is challenging in exabyte-scale data lakes. In this paper, we address this problem for our enterprise's data lake by using machine-learning to identify related data sources. Leveraging queries made to the data lake over a month, we build a relevance model that determines whether two columns across two data streams are related or not. We then use the model to find relations at scale across tens of millions of column-pairs and thereafter construct a data relationship graph in a scalable fashion, processing a data lake that has 4.5 Petabytes of data in approximately 80 minutes. Using manually labeled datasets as ground-truth, we show that our techniques show improvements of at least 23% when compared to state-of-the-art methods.

Publisher

VLDB Endowment

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3457390.3457403

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. A multi-start simulated annealing strategy for Data Lake Organization Problem;Applied Soft Computing;2024-07

2. Fainder: A Fast and Accurate Index for Distribution-Aware Dataset Search;Proceedings of the VLDB Endowment;2024-07

3. AutoFeat: Transitive Feature Discovery over Join Paths;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

4. R2D2: Reducing Redundancy and Duplication in Data Lakes;Proceedings of the ACM on Management of Data;2023-12-08

5. Data Lakes: A Survey of Functions and Systems;IEEE Transactions on Knowledge and Data Engineering;2023-12-01