Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching-Reference-Cited by-同舟云学术

Sparkly: A Simple yet Surprisingly Strong TF/IDF Blocker for Entity Matching

Published:2023-02 Issue:6 Volume:16 Page:1507-1519
ISSN:2150-8097
Container-title:Proceedings of the VLDB Endowment
language:en
Short-container-title:Proc. VLDB Endow.

Author:

Paulsen Derek¹,Govind Yash²,Doan AnHai¹

Affiliation:

1. University of Wisconsin-Madison and Informatica Inc.

2. Apple Inc.

Abstract

Blocking is a major task in entity matching. Numerous blocking solutions have been developed, but as far as we can tell, blocking using the well-known tf/idf measure has received virtually no attention. Yet, when we experimented with tf/idf blocking using Lucene, we found it did quite well. So in this paper we examine tf/idf blocking in depth. We develop Sparkly, which uses Lucene to perform top-k tf/idf blocking in a distributed share-nothing fashion on a Spark cluster. We develop techniques to identify good attributes and tokenizers that can be used to block on, making Sparkly completely automatic. We perform extensive experiments showing that Sparkly outperforms 8 state-of-the-art blockers. Finally, we provide an in-depth analysis of Sparkly's performance, regarding both recall/output size and runtime. Our findings suggest that (a) tf/idf blocking needs more attention, (b) Sparkly forms a strong baseline that future blocking work should compare against, and (c) future blocking work should seriously consider top-k blocking, which helps improve recall, and a distributed share-nothing architecture, which helps improve scalability, predictability, and extensibility.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Link

https://dl.acm.org/doi/pdf/10.14778/3583140.3583163

Reference40 articles.

1. Nils Barlaug and Jon Atle Gulla . 2020. Neural networks for entity matching. arXiv preprint arXiv:2010.11075 ( 2020 ). Nils Barlaug and Jon Atle Gulla. 2020. Neural networks for entity matching. arXiv preprint arXiv:2010.11075 (2020).

2. Scalable Blocking for Very Large Databases

3. Andrei Z. Broder , Michael Herscovici , and Jason Zien . 2003 . Efficient query evaluation using a two-level retrieval process . In In Proc. of the 12th ACM Conf. on Information and Knowledge Management. Andrei Z. Broder, Michael Herscovici, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In In Proc. of the 12th ACM Conf. on Information and Knowledge Management.

4. Robust and efficient fuzzy match for online data cleaning

5. Peter Christen . 2011. A survey of indexing techniques for scalable record linkage and deduplication . IEEE transactions on knowledge and data engineering 24, 9 ( 2011 ), 1537--1555. Peter Christen. 2011. A survey of indexing techniques for scalable record linkage and deduplication. IEEE transactions on knowledge and data engineering 24, 9 (2011), 1537--1555.

Cited by 8 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

2. Personalized new media marketing recommendation system based on TF-IDF algorithm optimizing LSTM-TC model;Service Oriented Computing and Applications;2024-08-06

3. Open benchmark for filtering techniques in entity resolution;The VLDB Journal;2024-07-09

4. Fairness-Aware Data Preparation for Entity Matching;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13

5. MultiEM: Efficient and Effective Unsupervised Multi-Table Entity Matching;2024 IEEE 40th International Conference on Data Engineering (ICDE);2024-05-13