Affiliation:
1. University of Modena and Reggio Emilia, Italy
2. University of Potsdam, Germany
Abstract
Entity Resolution (ER) aims to identify and merge records that refer to the same real-world entity. ER is typically employed as an expensive cleaning step on the entire data before consuming it. Yet, determining which entities are useful once cleaned depends solely on the user's application, which may need only a fraction of them. For instance, when dealing with Web data, we would like to be able to filter the entities of interest gathered from multiple sources without cleaning the entire, continuously-growing data. Similarly, when querying data lakes, we want to transform data on-demand and return the results in a timely manner---a fundamental requirement of ELT (
Extract-Load-Transform
) pipelines.
We propose
BrewER
, a framework to evaluate SQL SP queries on dirty data while progressively returning results as if they were issued on cleaned data.
BrewER
tries to focus the cleaning effort on one entity at a time, following an ORDER BY predicate. Thus, it inherently supports
top-k
and stop-and-resume execution. For a wide range of applications, a significant amount of resources can be saved. We exhaustively evaluate and show the efficacy of
BrewER
on four real-world datasets.
Publisher
Association for Computing Machinery (ACM)
Subject
General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development
Reference46 articles.
1. Altosight. Accessed on 2022-03-11. Altosight Official Website. https://altosight.com Altosight. Accessed on 2022-03-11. Altosight Official Website. https://altosight.com
2. QDA: A Query-Driven Approach to Entity Resolution
3. QuERy
4. Neural Networks for Entity Matching: A Survey
5. Swoosh: a generic approach to entity resolution
Cited by
10 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献