skip to main content
research-article
Open access

PrivBayes: Private Data Release via Bayesian Networks

Published: 27 October 2017 Publication History

Abstract

Privacy-preserving data publishing is an important problem that has been the focus of extensive study. The state-of-the-art solution for this problem is differential privacy, which offers a strong degree of privacy protection without making restrictive assumptions about the adversary. Existing techniques using differential privacy, however, cannot effectively handle the publication of high-dimensional data. In particular, when the input dataset contains a large number of attributes, existing methods require injecting a prohibitive amount of noise compared to the signal in the data, which renders the published data next to useless.
To address the deficiency of the existing methods, this paper presents PrivBayes, a differentially private method for releasing high-dimensional data. Given a dataset D, PrivBayes first constructs a Bayesian network N, which (i) provides a succinct model of the correlations among the attributes in D and (ii) allows us to approximate the distribution of data in D using a set P of low-dimensional marginals of D. After that, PrivBayes injects noise into each marginal in P to ensure differential privacy and then uses the noisy marginals and the Bayesian network to construct an approximation of the data distribution in D. Finally, PrivBayes samples tuples from the approximate distribution to construct a synthetic dataset, and then releases the synthetic data. Intuitively, PrivBayes circumvents the curse of dimensionality, as it injects noise into the low-dimensional marginals in P instead of the high-dimensional dataset D. Private construction of Bayesian networks turns out to be significantly challenging, and we introduce a novel approach that uses a surrogate function for mutual information to build the model more accurately. We experimentally evaluate PrivBayes on real data and demonstrate that it significantly outperforms existing solutions in terms of accuracy.

Supplementary Material

a25-zhang-apndx.pdf (zhang.zip)
Supplemental movie, appendix, image and software files for, PrivBayes: Private Data Release via Bayesian Networks

References

[1]
Kevin Bache and Moshe Lichman. 2013. UCI Machine Learning Repository (2013). Retrieved from http://archive.ics.uci.edu/ml.
[2]
Boaz Barak, Kamalika Chaudhuri, Cynthia Dwork, Satyen Kale, Frank McSherry, and Kunal Talwar. 2007. Privacy, accuracy, and consistency too: A holistic solution to contingency table release. In Proceedings of PODS. 273--282.
[3]
Roberto J. Bayardo and Rakesh Agrawal. 2005. Data privacy through optimal k-anonymization. In Proceedings of ICDE. 217--228.
[4]
Raghav Bhaskar, Srivatsan Laxman, Adam Smith, and Abhradeep Thakurta. 2010. Discovering frequent patterns in sensitive data. In Proceedings of KDD. 503--512.
[5]
Hayes Brian. 2002. Computing science: The easiest hard problem. American Scientist 90, 2 (2002), 113--117.
[6]
Chih-Chung Chang and Chih-Jen Lin. 2011. LIBSVM: A library for support vector machines. ACM TIST (2011), 27.
[7]
Kamalika Chaudhuri and Claire Monteleoni. 2008. Privacy-preserving logistic regression. In Proceedings of NIPS. 289--296.
[8]
Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially private empirical risk minimization. J. Mach. Learn. Res. 12 (2011), 1069--1109.
[9]
Rui Chen, Qian Xiao, Yu Zhang, and Jianliang Xu. 2015. Differentially private high-dimensional data publication via sampling-based inference. In Proceedings of SIGKDD. 129--138.
[10]
Yan Chen and Ashwin Machanavajjhala. 2015. On the privacy properties of variants on the sparse vector technique. CoRR abs/1508.07306 (2015).
[11]
David Maxwell Chickering, David Heckerman, and Christopher Meek. 2004. Large-sample learning of bayesian networks is NP-hard. J. Mach. Learn. Res. 5 (2004), 1287--1330.
[12]
C. K. Chow and C. N. Liu. 1968. Approximating discrete probability distributions with dependence trees. IEEE Trans. Info. Theory 14 (1968), 462--467.
[13]
CIA. 2015. The World Factbook 2014--15. Government Printing Office.
[14]
Graham Cormode, Cecilia Magdalena Procopiuc, Entong Shen, Divesh Srivastava, and Ting Yu. 2012. Differentially private spatial decompositions. In Proceedings of ICDE.
[15]
Graham Cormode, Cecilia Magdalena Procopiuc, Divesh Srivastava, and Thanh T. L. Tran. 2012. Differentially private publication of sparse data. In Proceedings of ICDT.
[16]
Aleksandr B. Cybakov. 2009. Introduction to Nonparametric Estimation. Springer.
[17]
Bolin Ding, Marianne Winslett, Jiawei Han, and Zhenhui Li. 2011. Differentially private data cubes: Optimizing noise sources and consistency. In Proceedings of SIGMOD. 217--228.
[18]
Cynthia Dwork. 2006. Differential privacy. In Proceedings of ICALP. 1--12.
[19]
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private data analysis. In Proceedings of TCC. 265--284.
[20]
Cynthia Dwork and Aaron Roth. 2013. The algorithmic foundations of differential privacy. Theor. Comput. Sci. 9, 3--4 (2013), 211--407.
[21]
Dan Feldman, Amos Fiat, Haim Kaplan, and Kobbi Nissim. 2009. Private coresets. In Proceedings of STOC. 361--370.
[22]
Arik Friedman and Assaf Schuster. 2010. Data mining with differential privacy. In Proceedings of KDD. 493--502.
[23]
Marco Gaboardi, Emilio Jesús Gallego Arias, Justin Hsu, Aaron Roth, and Zhiwei Steven Wu. 2014. Dual query: Practical private query release for high dimensional data. In Proceedings of ICML. 1170--1178.
[24]
F. Gray. 1953. Pulse code communication (March 17 1953). Retrieved from https://www.google.com/patents/US2632058. U.S. Patent 2,632,058.
[25]
Moritz Hardt. 2011. A Study of Privacy and Fairness in Sensitive Data Analysis. Ph.D. Dissertation. Princeton University.
[26]
Moritz Hardt, Katrina Ligett, and Frank McSherry. 2012. A simple and practical algorithm for differentially private data release. In Proceedings of NIPS. 2348--2356.
[27]
Michael Hay, Vibhor Rastogi, Gerome Miklau, and Dan Suciu. 2010. Boosting the accuracy of differentially private histograms through consistency. PVLDB 3, 1 (2010), 1021--1032.
[28]
Vijay S. Iyengar. 2002. Transforming data to satisfy privacy constraints. In Proceedings of IGKDD. 279--288.
[29]
Daniel Kifer, Adam D. Smith, and Abhradeep Thakurta. 2012. Private convex optimization for empirical risk minimization with applications to high-dimensional regression. J. Mach. Learn. Res. Proc. Track 23 (2012), 25.1--25.40.
[30]
Daphne Koller and Nir Friedman. 2009. Probabilistic Graphical Models: Principles and Techniques. MIT Press.
[31]
Chao Li, Michael Hay, Vibhor Rastogi, Gerome Miklau, and Andrew McGregor. 2010. Optimizing linear counting queries under differential privacy. In Proceedings of PODS. 123--134.
[32]
Chao Li and Gerome Miklau. 2012. An adaptive mechanism for accurate query answering under differential privacy. PVLDB 5, 6 (2012), 514--525.
[33]
Chao Li and Gerome Miklau. 2013. Optimal error of query sets under the differentially-private matrix mechanism. In Proceedings of ICDT. 272--283.
[34]
Ninghui Li, Wahbeh Qardaji, Dong Su, and Jianneng Cao. 2012. PrivBasis: Frequent itemset mining with differential privacy. PVLDB 5, 11 (2012), 1340--1351.
[35]
Kenneth G. Manton. 2010. National long-term care survey: 1982, 1984, 1989, 1994, 1999, and 2004. (2010).
[36]
Dimitris Margaritis. 2003. Learning Bayesian Network Model Structure from Data. Ph.D. Dissertation. School of Computer Science, Carnegie-Mellon University, Pittsburgh, PA.
[37]
Frank McSherry and Ratul Mahajan. 2010. Differentially-private network trace analysis. In Proceedings of SIGCOMM. 123--134.
[38]
Frank McSherry and Ilya Mironov. 2009. Differentially private recommender systems: Building privacy into the netflix prize contenders. In Proceedings of KDD. 627--636.
[39]
Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In Proceedings of FOCS. 94--103.
[40]
Prashanth Mohan, Abhradeep Thakurta, Elaine Shi, Dawn Song, and David Culler. 2012. GUPT: Privacy preserving data analysis made easy. In Proceedings of SIGMOD. 349--360.
[41]
Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of STOC. 75--84.
[42]
Vibhor Rastogi and Suman Nath. 2010. Differentially private aggregation of distributed time-series with transformation and encryption. In Proceedings of SIGMOD. 735--746.
[43]
Benjamin I. P. Rubinstein, Peter L. Bartlett, Ling Huang, and Nina Taft. 2012. Learning in a large function space: Privacy-preserving mechanisms for SVM learning. J. Priv. Confident. 4, 1 (2012), 65--100.
[44]
Steven Ruggles, Katie Genadek, Ronald Goeken, Josiah Grover, and Matthew Sobek. 2015. Integrated Public Use Microdata Series: Version 6.0. (2015). Retrieved from https://international.ipums.org.
[45]
Adam Smith. 2011. Privacy-preserving statistical estimation with optimal convergence rate. In Proceedings of STOC.
[46]
Xiaokui Xiao, Guozhang Wang, and Johannes Gehrke. 2010. Differential privacy via wavelet transforms. In Proceedings of ICDE. 225--236.
[47]
Grigory Yaroslavtsev, Graham Cormode, Cecilia M. Procopiuc, and Divesh Srivastava. 2013. Accurate and efficient private release of datacubes and contingency tables. In Proceedings of ICDE. 745--756.
[48]
Ganzhao Yuan, Zhenjie Zhang, Marianne Winslett, Xiaokui Xiao, Yin Yang, and Zhifeng Hao. 2012. Low-rank mechanism: Optimizing batch queries under differential privacy. PVLDB 5, 11 (2012), 1352--1363.
[49]
Jun Zhang, Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava, and Xiaokui Xiao. 2014. PrivBayes: Private data release via bayesian networks. In Proceedings of SIGMOD. 1423--1434.
[50]
Jun Zhang, Xiaokui Xiao, Yin Yang, Zhenjie Zhang, and Marianne Winslett. 2013. PrivGene: Differentially private model fitting using genetic algorithms. In Proceedings of SIGMOD. 665--676.

Cited By

View all
  • (2025)Synthetic data as external control arms in scarce single-arm clinical trialsPLOS Digital Health10.1371/journal.pdig.00005814:1(e0000581)Online publication date: 23-Jan-2025
  • (2025)Preserving information while respecting privacy through an information theoretic framework for synthetic health data generationnpj Digital Medicine10.1038/s41746-025-01431-68:1Online publication date: 23-Jan-2025
  • (2025)Alternating minimization differential privacy protection algorithm for the novel dual-mode learning tasks modelExpert Systems with Applications10.1016/j.eswa.2024.125279259(125279)Online publication date: Jan-2025
  • Show More Cited By

Index Terms

  1. PrivBayes: Private Data Release via Bayesian Networks

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Database Systems
    ACM Transactions on Database Systems  Volume 42, Issue 4
    Invited Paper from SIGMOD 2016, Invited Paper from PODS 2016, Invited Paper from ICDT 2016 and Regular Papers
    December 2017
    241 pages
    ISSN:0362-5915
    EISSN:1557-4644
    DOI:10.1145/3155316
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2017
    Accepted: 01 August 2017
    Revised: 01 March 2017
    Received: 01 July 2016
    Published in TODS Volume 42, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Differential privacy
    2. bayesian network
    3. synthetic data generation

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • DSAIR center at Nanyang Technological University
    • AT8T
    • European Commission Marie Curie Integration
    • Ministry of Education (Singapore)
    • Microsoft Research Asia

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)1,867
    • Downloads (Last 6 weeks)199
    Reflects downloads up to 20 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)Synthetic data as external control arms in scarce single-arm clinical trialsPLOS Digital Health10.1371/journal.pdig.00005814:1(e0000581)Online publication date: 23-Jan-2025
    • (2025)Preserving information while respecting privacy through an information theoretic framework for synthetic health data generationnpj Digital Medicine10.1038/s41746-025-01431-68:1Online publication date: 23-Jan-2025
    • (2025)Alternating minimization differential privacy protection algorithm for the novel dual-mode learning tasks modelExpert Systems with Applications10.1016/j.eswa.2024.125279259(125279)Online publication date: Jan-2025
    • (2025)Privacy Re‐Identification Attacks on Tabular GANsSecurity and Privacy10.1002/spy2.4698:1Online publication date: 12-Jan-2025
    • (2024)The real deal behind the artificial appealProceedings of the Fortieth Conference on Uncertainty in Artificial Intelligence10.5555/3702676.3702722(966-996)Online publication date: 15-Jul-2024
    • (2024)Bounding the excess risk for linear models trained on marginal-preserving, differentially-private, synthetic dataProceedings of the 41st International Conference on Machine Learning10.5555/3692070.3694635(61979-62001)Online publication date: 21-Jul-2024
    • (2024)Finnish perspective on using synthetic health data to protect privacy: the PRIVASA projectApplied Computing and Intelligence10.3934/aci.20240094:2(138-163)Online publication date: 2024
    • (2024)Human-Unrecognizable Differential Private Noised Image Generation MethodSensors10.3390/s2410316624:10(3166)Online publication date: 16-May-2024
    • (2024)A Range Query Scheme for Spatial Data with Shuffled Differential PrivacyMathematics10.3390/math1213193412:13(1934)Online publication date: 21-Jun-2024
    • (2024)SETSUBUN: Revisiting Membership Inference Game for Evaluating Synthetic Data GenerationJournal of Information Processing10.2197/ipsjjip.32.75732(757-766)Online publication date: 2024
    • Show More Cited By

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Full Access

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media