Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection-Reference-Cited by-同舟云学术

Transforming Pairwise Duplicates to Entity Clusters for High-quality Duplicate Detection

Published:2020-01-23 Issue:1 Volume:12 Page:1-30
ISSN:1936-1955
Container-title:Journal of Data and Information Quality
language:en
Short-container-title:J. Data and Information Quality

Author:

Draisbach Uwe¹,Christen Peter²,Naumann Felix¹

Affiliation:

1. Hasso-Plattner-Institute, University of Potsdam, Potsdam, Germany

2. Australian National University, Canberra, Australia

Abstract

Duplicate detection algorithms produce clusters of database records, each cluster representing a single real-world entity. As most of these algorithms use pairwise comparisons, the resulting (transitive) clusters can be inconsistent: Not all records within a cluster are sufficiently similar to be classified as duplicate. Thus, one of many subsequent clustering algorithms can further improve the result. We explain in detail, compare, and evaluate many of these algorithms and introduce three new clustering algorithms in the specific context of duplicate detection. Two of our three new algorithms use the structure of the input graph to create consistent clusters. Our third algorithm, and many other clustering algorithms, focus on the edge weights, instead. For evaluation, in contrast to related work, we experiment on true real-world datasets, and in addition examine in great detail various pair-selection strategies used in practice. While no overall winner emerges, we are able to identify best approaches for different situations. In scenarios with larger clusters, our proposed algorithm, Extended Maximum Clique Clustering (EMCC), and Markov Clustering show the best results. EMCC especially outperforms Markov Clustering regarding the precision of the results and additionally has the advantage that it can also be used in scenarios where edge weights are not available.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/3352591

Reference41 articles.

1. The Star Clustering Algorithm for Static and Dynamic Information Organization

2. Correlation Clustering

3. Swoosh: a generic approach to entity resolution

4. Algorithm 457: finding all cliques of an undirected graph

Cited by 12 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Research on Hybrid Data Clustering Algorithm for Wireless Communication Intelligent Bracelets;Mobile Networks and Applications;2023-09-19

2. Clustering Heterogeneous Data Values for Data Quality Analysis;Journal of Data and Information Quality;2023-08-22

3. Context Extraction in Unsupervised Entity Resolution;2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE);2023-07-24

4. An analysis of one-to-one matching algorithms for entity resolution;The VLDB Journal;2023-04-18

5. More extreme duplication in FDA Adverse Event Reporting System detected by literature reference normalization and fuzzy string matching;Pharmacoepidemiology and Drug Safety;2022-12-09