Convergence Diagnostics for Entity Resolution

Author:

Aleshin-Guendel Serge1,Steorts Rebecca C.123

Affiliation:

1. 1Department of Statistical Science, Duke University, Durham, North Carolina, USA; email: serge.aleshin-guendel@duke.edu

2. 2Department of Computer Science, Department of Biostatistics and Bioinformatics, the Rhodes Information Initiative at Duke (iiD), and the Social Science Research Institute (SSRI), Duke University, Durham, North Carolina, USA

3. 3Center for Statistical Research and Methodology, United States Census Bureau, Suitland, Maryland, USA

Abstract

Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo (MCMC) sampling is the primary computational method for approximate posterior inference in this setting, but due to the high dimensionality of the space of partitions, there are no agreed upon standards for diagnosing nonconvergence of MCMC sampling. In this article, we review Bayesian entity resolution, with a focus on the specific challenges that it poses for the convergence of a Markov chain. We review prior methods for convergence diagnostics, discussing their weaknesses. We provide recommendations for using MCMC sampling for Bayesian entity resolution, focusing on the use of modern diagnostics that are commonplace in applied Bayesian statistics. Using simulated data, we find that a commonly used Gibbs sampler performs poorly compared with two alternatives.

Publisher

Annual Reviews

Reference84 articles.

1. multilink: Multifile record linkage and duplicate detection;R Package,2023

2. Multifile partitioning for record linkage and duplicate detection;J. Am. Stat. Assoc.,2023

3. Comparing methods for record linkage for public health action: matching algorithm validation study;JMIR Public Health Surveill,2020

4. Using statistics to assess lethal violence in civil and inter-state war;Annu. Rev. Stat. Appl.,2019

5. Spatial statistics and Bayesian computation;J. R. Stat. Soc. Ser. B,1993

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3