No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile-Reference-Cited by-同舟云学术

No ground truth? No problem: Improving administrative data linking using active learning and a little bit of guile

Published:2023-04-04 Issue:4 Volume:18 Page:e0283811
ISSN:1932-6203
Container-title:PLOS ONE
language:en
Short-container-title:PLoS ONE

Author:

Tahamont Sarah^ORCID,Jelveh Zubin^ORCID,McNeill Melissa,Yan Shi,Chalfin Aaron,Hansen Benjamin

Abstract

While linking records across large administrative datasets [“big data”] has the potential to revolutionize empirical social science research, many administrative data files do not have common identifiers and are thus not designed to be linked to others. To address this problem, researchers have developed probabilistic record linkage algorithms which use statistical patterns in identifying characteristics to perform linking tasks. Naturally, the accuracy of a candidate linking algorithm can be substantially improved when an algorithm has access to “ground-truth” examples—matches which can be validated using institutional knowledge or auxiliary data. Unfortunately, the cost of obtaining these examples is typically high, often requiring a researcher to manually review pairs of records in order to make an informed judgement about whether they are a match. When a pool of ground-truth information is unavailable, researchers can use “active learning” algorithms for linking, which ask the user to provide ground-truth information for select candidate pairs. In this paper, we investigate the value of providing ground-truth examples via active learning for linking performance. We confirm popular intuition that data linking can be dramatically improved with the availability of ground truth examples. But critically, in many real-world applications, only a relatively small number of tactically-selected ground-truth examples are needed to obtain most of the achievable gains. With a modest investment in ground truth, researchers can approximate the performance of a supervised learning algorithm that has access to a large database of ground truth examples using a readily available off-the-shelf tool.

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference34 articles.

1. Algorithmic identification of Ph.D. thesis-related publications: A proof-of-concept study;P Donner;Scientometrics,2022

2. Economics in the age of big data;L Einav;Science,2014

3. The challenges of doing criminology in the big data era: Towards a digital and data-driven approach;GJD Smith;The British Journal of Criminology,2017

4. Enhancing the ATra Black Box matching algorithm: Use of all names for deduplication across jurisdictions;AD Hamp;Public Health Reports,2023

5. Building an infrastructure to support the use of government administrative data for program performance and social science research;J Lane;The ANNALS of the American Academy of Political and Social Science,2018

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The problem with criminal records: Discrepancies between state reports and private‐sector background checks;Criminology;2024-02