Reach for gold-Reference-Cited by-同舟云学术

Reach for gold

Published:2014-09-04 Issue:1-2 Volume:5 Page:1-25
ISSN:1936-1955
Container-title:Journal of Data and Information Quality
language:en
Short-container-title:J. Data and Information Quality

Author:

Vogel Tobias¹^ORCID,Heise Arvid¹,Draisbach Uwe¹,Lange Dustin¹,Naumann Felix¹

Affiliation:

1. Hasso Plattner Institute

Abstract

Duplicates in a database are one of the prime causes of poor data quality and are at the same time among the most difficult data quality problems to alleviate. To detect and remove such duplicates, many commercial and academic products and methods have been developed. The evaluation of such systems is usually in need of pre-classified results. Such gold standards are often expensive to come by (much manual classification is necessary), not representative (too small or too synthetic), and proprietary and thus preclude repetition (company-internal data). This lament has been uttered in many papers and even more paper reviews. The proposed annealing standard is a structured set of duplicate detection results, some of which are manually verified and some of which are merely validated by many classifiers. As more and more classifiers are evaluated against the annealing standard, more and more results are verified and validation becomes more and more confident. We formally define gold, silver, and the annealing standard and their maintenance. Experiments show how quickly an annealing standard converges to a gold standard. Finally, we provide an annealing standard for 750,000 CDs to the duplicate detection community.

Publisher

Association for Computing Machinery (ACM)

Subject

Information Systems and Management,Information Systems

Link

https://dl.acm.org/doi/pdf/10.1145/2629687

Reference41 articles.

1. Adaptive duplicate detection using learnable string similarity measures

Cited by 6 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Frost;Proceedings of the VLDB Endowment;2022-08

2. All-Three: Near-optimal and domain-independent algorithms for near-duplicate detection;Array;2021-09

3. Leveraging active learning to reduce human effort in the generation of ground‐truth for entity resolution;Computational Intelligence;2020-05

4. Generating automatically labeled data for author name disambiguation: an iterative clustering method;Scientometrics;2018-11-29

5. Dynamical order construction in data fusion;Information Fusion;2016-01