The merge/purge problem for large databases-Reference-Cited by-同舟云学术

The merge/purge problem for large databases

Published:1995-05-22 Issue:2 Volume:24 Page:127-138
ISSN:0163-5808
Container-title:ACM SIGMOD Record
language:en
Short-container-title:SIGMOD Rec.

Author:

Hernández Mauricio A.¹,Stolfo Salvatore J.¹

Affiliation:

1. Department of Computer Science, Columbia University, New York, NY

Abstract

Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the merge/purge problem. In this paper we detail the sorted neighborhood method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a multi-pass approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems,Software

Link

https://dl.acm.org/doi/pdf/10.1145/568271.223807

Reference11 articles.

1. Automatic correction to misspelled names

2. Duplicate record elimination in large data files

Cited by 239 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. SETEM: Self-ensemble training with Pre-trained Language Models for Entity Matching;Knowledge-Based Systems;2024-06

2. On tuning parameters guiding similarity computations in a data deduplication pipeline for customers records;Information Systems;2024-03

3. SC-Block: Supervised Contrastive Blocking Within Entity Resolution Pipelines;Lecture Notes in Computer Science;2024

4. Train Once, Match Everywhere: Harnessing Generative Language Models for Entity Matching;2023 International Conference on Computational Science and Computational Intelligence (CSCI);2023-12-13

5. SNIP: An adaptation of sorted neighborhood methods for deduplicating pedigree data;The Annals of Applied Statistics;2023-09-01