Keeping the Data Lake in Form-Reference-Cited by-同舟云学术

Keeping the Data Lake in Form

Published:2020-06-26 Issue:3 Volume:38 Page:1-30
ISSN:1046-8188
Container-title:ACM Transactions on Information Systems
language:en
Short-container-title:ACM Trans. Inf. Syst.

Author:

Alserafi Ayman¹,Abelló Alberto²,Romero Oscar²,Calders Toon³

Affiliation:

1. Universitat Politècnica de Catalunya and Université Libre de Bruxelles, Bruxelles, Belgium

2. Universitat Politècnica de Catalunya, Barcelona, Catalunya, Spain

3. Université Libre de Bruxelles and Universiteit Antwerpen, Antwerpen, Belgium

Abstract

Data lakes (DLs) are large repositories of raw datasets from disparate sources. As more datasets are ingested into a DL, there is an increasing need for efficient techniques to profile them and to detect the relationships among their schemata, commonly known as holistic schema matching . Schema matching detects similarity between the information stored in the datasets to support information discovery and retrieval. Currently, this is computationally expensive with the volume of state-of-the-art DLs. To handle this challenge, we propose a novel early-pruning approach to improve efficiency, where we collect different types of content metadata and schema metadata about the datasets, and then use this metadata in early-pruning steps to pre-filter the schema matching comparisons. This involves computing proximities between datasets based on their metadata, discovering their relationships based on overall proximities and proposing similar dataset pairs for schema matching. We improve the effectiveness of this task by introducing a supervised mining approach for effectively detecting similar datasets that are proposed for further schema matching. We conduct extensive experiments on a real-world DL that proves the success of our approach in effectively detecting similar datasets for schema matching, with recall rates of more than 85% and efficiency improvements above 70%. We empirically show the computational cost saving in space and time by applying our approach in comparison to instance-based schema matching techniques.

Funder

Erasmus+

Publisher

Association for Computing Machinery (ACM)

Subject

Computer Science Applications,General Business, Management and Accounting,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3388870

Reference36 articles.

1. Profiling relational data: a survey

2. Incorporating contextual information in recommender systems using a multidimensional approach

3. A Clustering-Based Approach for Large-Scale Ontology Matching