Improved Assessment of the Accuracy of Record Linkage via an Extended MaCSim Approach

Author:

Haque Shovanur1,Mengersen Kerrie1

Affiliation:

1. School of Mathematical Sciences , Queensland University of Technology , 2 George Street, Brisbane, QLD 4000, Brisbane 4000 , Australia .

Abstract

Abstract Record linkage is the process of bringing together the same entity from overlapping data sources while removing duplicates. Huge amounts of data are now being collected by public or private organizations as well as by researchers and individuals. Linking and analysing relevant information from this massive data reservoir can provide new insights into society. It has become increasingly important to have effective and efficient methods for linking data from different sources. Therefore, it becomes necessary to assess the ability of a linking method to achieve high accuracy or to compare between methods with respect to accuracy. In this article, we improve on a Markov Chain based Monte Carlo simulation approach (MaCSim) for assessing a linking method. The improvement proposed here involves calculation of a similarity weight for every linking variable value for each record pair, which allows partial agreement of the linking variable values. To assess the accuracy of the linking method, correctly linked proportions are investigated for each record. The extended MaCSim approach is illustrated using a synthetic data set provided by the Australian Bureau of Statistics based on realistic data settings. Test results show high accuracy of the assessment of the linkages.

Publisher

Walter de Gruyter GmbH

Reference48 articles.

1. Belin, T.R., and D.B. Rubin. 1995. “A Method for Calibrating False-Match Rates in Record Linkage.” Journal of the American Statistical Association, 90 (430): 694–707. DOI: https://doi.org/10.1080/01621459.1995.10476563.10.1080/01621459.1995.10476563

2. Borkar, V., K. Deshmukh, and S. Sarawagi. 2001. “Automatic Segmentation of Text into Structured Records.” Association of Computing Machinery SIGMOD, 30, no. 2: 175–186. DOI: https://doi.org/10.1145/376284.375682.10.1145/376284.375682

3. Chambers, R. 2009. “Regression analysis of probability-linked data.” Statisphere 4, Official Statistics Research Series, Statistics New Zealand. Available at: http://www.statisphere.govt.nz/official-statistics-research/series/vol-4.htm.

4. Chambers, R., J.O. Chipperfield, W. Davis, and M. Kovacevic. 2009. Inference Based on Estimating Equations and Probability-Linked Data. Centre for Statistical and Survey Methodology, University of Wollongong, Working Paper 18(09). Available at: https://ro.uow.edu.au/cssmwp/38 (accessed August 2015).

5. Chipperfield, J.O., G.R. Bishop, and P. Campbell. 2011. Maximum likelihood estimation for contingency tables and logistic regression with incorrectly linked data. Statistics Canada. Available at: https://www150.statcan.gc.ca/n1/en/pub/12-001-x/2011001/article/11444-eng.pdf?st=NcU2PgN1 (accessed August 2015).

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3