Affiliation:
1. Department of Computer Science, Columbia University, New York, NY
Abstract
Many commercial organizations routinely gather large numbers of databases for various marketing and business analysis functions. The task is to correlate information from different databases by identifying distinct individuals that appear in a number of different databases typically in an inconsistent and often incorrect fashion. The problem we study here is the task of merging data from multiple sources in as efficient manner as possible, while maximizing the accuracy of the result. We call this the
merge/purge
problem. In this paper we detail the
sorted neighborhood
method that is used by some to solve merge/purge and present experimental results that demonstrates this approach may work well in practice but at great expense. An alternative method based upon clustering is also presented with a comparative evaluation to the sorted neighborhood method. We show a means of improving the accuracy of the results based upon a
multi-pass
approach that succeeds by computing the Transitive Closure over the results of independent runs considering alternative primary key attributes in each pass.
Publisher
Association for Computing Machinery (ACM)
Subject
Information Systems,Software
Cited by
239 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献