Author:
Wang Yuyang,Meng Xianjia,Liu Ximeng
Abstract
AbstractDeep learning techniques have been widely used in natural language processing (NLP) tasks and have made remarkable progress. However, training the deep learning model relies on a large amount of data which may involve sensitive information like electronic medical records. The attacker can infer sensitive information from the model, which leads to privacy leakage. To solve this problem, we propose a Differentially Private Recurrent Variational AutoEncoder (DP-RVAE) that can generate simulated data in place of the sensitive dataset to preserve privacy. To generate high utility synthetic text, a part of sensitive text data is employed as the conditional input of the model and uses a dropout and noise perturbing mechanism to preserve differential privacy. In addition, we expand the proposed DP-RVAE to a federated learning setting and design a novel training paradigm for NLP tasks. Specifically, DP-RVAE is deployed to the client-side to train and generate personalized text. These DP-RVAE models would be aggregated and updated through the Federated Optimisation (FedOPT) algorithm so that personal information can be well preserved. We evaluate our proposed DP-RVAE through a text classification task on the Tweets depression sentiment and IMDB reviews datasets. Our DP-RVAE achieves a higher average test accuracy by 5.90% and 3.94% compared to the typical centralized training and federated learning approach, respectively. We also perform the keywords inference attack experiment on the medical description dataset collected from the real world. Compared to the typical differentially private preserving approach, the DP-RVAE decreases by 15.2% in average attack accuracy. The experimental results demonstrate that DP-RVAE can be applied to the NLP models to leverage accuracy while preserving sensitive privacy.
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Hardware and Architecture,Information Systems,Software
Reference66 articles.
1. Yao L, Mao C, Luo Y (2019) Clinical text classification with rule-based features and knowledge-guided convolutional neural networks. BMC Med Inform Decision Making 19(3):31–39
2. Xu H, Liu B, Shu L, Yu PS (2019) BERT post-training for review reading comprehension and aspect-based sentiment analysis. arXiv:1904.02232
3. Singhal K, Sidahmed H, Garrett Z, Wu S, Rush K, Prakash S (2021) Federated reconstruction: Partially local federated learning. arXiv:2102.03448
4. Zeng G, Yang W, Ju Z, Yang Y, Wang S, Zhang R, Zhou M, Zeng J, Dong X, Zhang R et al (2020) MedDialog: A large-scale medical dialogue dataset. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp 9241–9250
5. Pan X, Zhang M, Ji S, Yang M (2020) Privacy risks of general-purpose language models. In: 2020 IEEE symposium on security and privacy (SP). IEEE, pp 1314–1331
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献