Saibot: A Differentially Private Data Search Platform

Author:

Huang Zezhou1,Liu Jiaxiang1,Alabi Daniel Gbenga1,Fernandez Raul Castro2,Wu Eugene3

Affiliation:

1. Columbia University

2. University of Chicago

3. DSI, Columbia University

Abstract

Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search for augmentations ---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We present Saibot , a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates that Saibot can return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse.

Publisher

Association for Computing Machinery (ACM)

Subject

General Earth and Planetary Sciences,Water Science and Technology,Geography, Planning and Development

Reference63 articles.

1. [n.d.]. 2013 -- 2018 School ELA REsults. https://data.cityofnewyork.us/Education/2013-2018-School-ELA-REsults/qkpp-pbi8. [n.d.]. 2013 -- 2018 School ELA REsults. https://data.cityofnewyork.us/Education/2013-2018-School-ELA-REsults/qkpp-pbi8.

2. [n.d.]. 2013 --2018 School Math Results. https://data.cityofnewyork.us/Education/2013-2018-School-Math-Results/m27t-ht3h. [n.d.]. 2013 --2018 School Math Results. https://data.cityofnewyork.us/Education/2013-2018-School-Math-Results/m27t-ht3h.

3. [n.d.]. 2013--16 School ELA Data Files By Grade - Gender. https://data.cityofnewyork.us/Education/2013-16-School-ELA-Data-Files-By-Grade-Gender/436j-ja87. [n.d.]. 2013--16 School ELA Data Files By Grade - Gender. https://data.cityofnewyork.us/Education/2013-16-School-ELA-Data-Files-By-Grade-Gender/436j-ja87.

4. [n.d.]. 2014--15 To 2016--17 School- Level NYC Regents Report For All Variables. https://data.cityofnewyork.us/Education/2014-15-To-2016-17-School-Level-NYC-Regents-Report/csps-2ne9/. [n.d.]. 2014--15 To 2016--17 School- Level NYC Regents Report For All Variables. https://data.cityofnewyork.us/Education/2014-15-To-2016-17-School-Level-NYC-Regents-Report/csps-2ne9/.

5. [n.d.]. 2016--2017 Graduation Outcomes School. https://data.cityofnewyork.us/Education/2016-2017-Graduation-Outcomes-School/nb39-jx2v. [n.d.]. 2016--2017 Graduation Outcomes School. https://data.cityofnewyork.us/Education/2016-2017-Graduation-Outcomes-School/nb39-jx2v.

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3