HT-Fed-GAN: Federated Generative Model for Decentralized Tabular Data Synthesis
Author:
Duan Shaoming, Liu Chuanyi, Han Peiyi, Jin Xiaopeng, Zhang XinyiORCID, He Tianyu, Pan Hezhong, Xiang Xiayu
Abstract
In this paper, we study the problem of privacy-preserving data synthesis (PPDS) for tabular data in a distributed multi-party environment. In a decentralized setting, for PPDS, federated generative models with differential privacy are used by the existing methods. Unfortunately, the existing models apply only to images or text data and not to tabular data. Unlike images, tabular data usually consist of mixed data types (discrete and continuous attributes) and real-world datasets with highly imbalanced data distributions. Existing methods hardly model such scenarios due to the multimodal distributions in the decentralized continuous columns and highly imbalanced categorical attributes of the clients. To solve these problems, we propose a federated generative model for decentralized tabular data synthesis (HT-Fed-GAN). There are three important parts of HT-Fed-GAN: the federated variational Bayesian Gaussian mixture model (Fed-VB-GMM), which is designed to solve the problem of multimodal distributions; federated conditional one-hot encoding with conditional sampling for global categorical attribute representation and rebalancing; and a privacy consumption-based federated conditional GAN for privacy-preserving decentralized data modeling. The experimental results on five real-world datasets show that HT-Fed-GAN obtains the best trade-off between the data utility and privacy level. For the data utility, the tables generated by HT-Fed-GAN are the most statistically similar to the original tables and the evaluation scores show that HT-Fed-GAN outperforms the state-of-the-art model in terms of machine learning tasks.
Funder
National Natural Science Foundation of China
Subject
General Physics and Astronomy
Reference39 articles.
1. Scaling up the accuracy of naive-bayes classifiers: A decision-tree hybrid;Kohavi;KDD 1996 Proceedings,1996 2. McFee, B., Bertin-Mahieux, T., Ellis, D.P., and Lanckriet, G.R. (2012, January 16–20). The million song dataset challenge. Proceedings of the 21st International Conference on World Wide Web, Lyon, France. 3. Shi, B., Yao, C., Liao, M., Yang, M., Xu, P., Cui, L., Belongie, S., Lu, S., and Bai, X. (2017, January 9–15). ICDAR2017 competition on reading chinese text in the wild (RCTW-17). Proceedings of the 2017 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), Kyoto, Japan. 4. Park, N., Mohammadi, M., Gorde, K., Jajodia, S., Park, H., and Kim, Y. (2018, January 27–31). Data synthesis based on generative adversarial networks. Proceedings of the VLDB Endowment 2018, Rio de Janeiro, Brazil. 5. Jordon, J., Yoon, J., and Van Der Schaar, M. (2019, January 6–9). PATE-GAN: Generating synthetic data with differential privacy guarantees. Proceedings of the International Conference on Learning Representations, New Orleans, OR, USA.
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献
1. SCGAN: Semi-Centralized Generative Adversarial Network for image generation in distributed scenes;Information Fusion;2024-12 2. FLIGAN;Proceedings of the 7th International Workshop on Edge Systems, Analytics and Networking;2024-04-22 3. Federated learning for generating synthetic data: a scoping review;International Journal of Population Data Science;2023-10-31 4. Attribute-Centric and Synthetic Data Based Privacy Preserving Methods: A Systematic Review;Journal of Cybersecurity and Privacy;2023-09-11 5. Securing Federated GANs: Enabling Synthetic Data Generation for Health Registry Consortiums;Proceedings of the 18th International Conference on Availability, Reliability and Security;2023-08-29
|
|