Author:
Prediger Lukas,Jälkö Joonas,Honkela Antti,Kaski Samuel
Abstract
Abstract
Background
Consider a setting where multiple parties holding sensitive data aim to collaboratively learn population level statistics, but pooling the sensitive data sets is not possible due to privacy concerns and parties are unable to engage in centrally coordinated joint computation. We study the feasibility of combining privacy preserving synthetic data sets in place of the original data for collaborative learning on real-world health data from the UK Biobank.
Methods
We perform an empirical evaluation based on an existing prospective cohort study from the literature. Multiple parties were simulated by splitting the UK Biobank cohort along assessment centers, for which we generate synthetic data using differentially private generative modelling techniques. We then apply the original study’s Poisson regression analysis on the combined synthetic data sets and evaluate the effects of 1) the size of local data set, 2) the number of participating parties, and 3) local shifts in distributions, on the obtained likelihood scores.
Results
We discover that parties engaging in the collaborative learning via shared synthetic data obtain more accurate estimates of the regression parameters compared to using only their local data. This finding extends to the difficult case of small heterogeneous data sets. Furthermore, the more parties participate, the larger and more consistent the improvements become up to a certain limit. Finally, we find that data sharing can especially help parties whose data contain underrepresented groups to perform better-adjusted analysis for said groups.
Conclusions
Based on our results we conclude that sharing of synthetic data is a viable method for enabling learning from sensitive data without violating privacy constraints even if individual data sets are small or do not represent the overall population well. Lack of access to distributed sensitive data is often a bottleneck in biomedical research, which our study shows can be alleviated with privacy-preserving collaborative learning methods.
Funder
Research Council of Finland
European Union
Strategic Research Council (SRC) established within the Research Council of Finland
UK Research and Innovation
Publisher
Springer Science and Business Media LLC
Reference43 articles.
1. Dwork C, McSherry F, Nissim K, Smith AD. Calibrating Noise to Sensitivity in Private Data Analysis. In: 3rd Theory of Cryptography Conf. Berlin, Heidelberg: Springer; 2006. p. 265–84.
2. Hardt M, Ligett K, McSherry F. A Simple and Practical Algorithm for Differentially Private Data Release. In: Adv. Neural Inf. Process. Syst. Red Hook: Curran Associates, Inc; 2012. p. 2339–47.
3. Chen R, Acs G, Castelluccia C. Differentially Private Sequential Data Publication via Variable-length n-grams. In: Proc. 2012 ACM Conf. Comput. and Commun. Security. New York: ACM; 2012. p. 638–49.
4. Zhang J, Cormode G, Procopiuc CM, Srivastava D, Xiao X. PrivBayes: Private Data Release via Bayesian Networks. In: Proc. 2014 ACM SIGMOD Int. Conf. Manage. Data. SIGMOD ’14. New York: ACM; 2014. p. 1423–34.
5. Acs G, Melis L, Castelluccia C, De Cristofaro E. Differentially Private Mixture of Generative Neural Networks. IEEE Trans Knowl Data Eng. 2019;31(6):1109-21.