Simulated data for census-scale entity resolution research without privacy restrictions: a large-scale dataset generated by individual-based modeling

Author:

Haddock Beatrix,Pletcher AlixORCID,Blair-Stahn Nathaniel,Keyes Os,Kappel MattORCID,Bachmeier Steve,Lutze SylORCID,Albright James,Bowman Alison,Kinuthia Caroline,Burke-Conte Zeb,Mudambi Rajan,Flaxman AbrahamORCID

Abstract

Background Entity resolution (ER) is the process of identifying and linking records that refer to the same real-world entity. ER is a fundamental challenge in data science, and a common barrier to ER research and development is that the data fields used for this fuzzy matching are personally identifiable information, such as name, address, and date of birth. The necessary restrictions on accessing and sharing these authentic data have slowed the work in developing, testing, and adopting new methods and software for ER. We recently released pseudopeople, a Python package that allows users to generate simulated datasets approaching the scale and complexity of the data on which large organizations and federal agencies, like the US Census Bureau regularly perform ER. With pseudopeople, researchers can develop new algorithms and software for ER of US population data without needing access to personal and confidential information. Methods We created the simulated population data available through pseudopeople using our Vivarium simulation platform. Our model simulates individuals and their families, households, and employment dynamics over time, which we observe through simulated censuses, surveys, and administrative data collection systems. Results Our simulation process produced over 900 gigabytes of simulated censuses, surveys, and administrative data for pseudopeople, representing hundreds of millions of simulants. A sample simulated population of thousands of simulants is now openly available to all users of the pseudopeople package, and large-scale simulated populations of millions and hundreds of millions of simulants are also available by online request through GitHub. These simulated population data are structured for use by the pseudopeople package, which includes additional affordances to add various kinds of noise to the data to provide realistic, sharable challenges for ER researchers.

Funder

Bill and Melinda Gates Foundation

U.S. Census Bureau

Publisher

F1000 Research Ltd

Reference27 articles.

1. (Almost) all of entity resolution.;O Binette;Sci Adv.,2022

2. The role of administrative data in the big data revolution in social science research.;R Connelly;Soc Sci Res.,2016

3. Leveraging administrative data to better serve children and families.;R Fischer;Public Adm Rev.,2019

4. Thirty-three myths and misconceptions about population data: from data capture and processing to linkage.;P Christen;Int J Popul Data Sci.,2023

5. Four cooperative agreements: census bureau research on record linkage and entity resolution

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3