Abstract
AbstractMost privacy-preserving machine learning methods are designed around continuous or numeric data, but categorical attributes are common in many application scenarios, including clinical and health records, census and survey data. Distance-based methods, in particular, have limited applicability to categorical data, since they do not capture the complexity of the relationships among different values of a categorical attribute. Although distance learning algorithms exist for categorical data, they may disclose private information about individual records if applied to a secret dataset. To address this problem, we introduce a differentially private family of algorithms for learning distances between any pair of values of a categorical attribute according to the way they are co-distributed with the values of other categorical attributes forming the so-called context. We define different variants of our algorithm and we show empirically that our approach consumes little privacy budget while providing accurate distances, making it suitable in distance-based applications, such as clustering and classification.
Publisher
Springer Science and Business Media LLC
Subject
Computer Networks and Communications,Computer Science Applications,Information Systems
Reference35 articles.
1. Alamuri M, Raju SB, Negi A (2014) A survey of distance/similarity measures for categorical data. In: Proceedings of IJCNN 2014. IEEE, pp 1907–1914
2. Anandan B, Clifton C (2018) Differentially private feature selection for data mining. In: Proceedings of ACM IWSPA@CODASPY 2018, pp 43–53
3. Aumüller M, Bourgeat A, Schmurr J (2020) Differentially private sketches for Jaccard similarity estimation. In: Proceedings of SISAP 2020. Springer, Berlin, pp 18–32
4. Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc Ser B (Methodol) 57(1):289–300
5. Boriah S, Chandola V, Kumar V (2008) Similarity measures for categorical data: a comparative evaluation. In: Proceedings of SIAM SDM 2008, pp 243–254