Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack-Reference-Cited by-同舟云学术

Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack

Published:2024-09-03 Issue: Volume: Page:
ISSN:1384-5810
Container-title:Data Mining and Knowledge Discovery
language:en
Short-container-title:Data Min Knowl Disc

Author:

Manzanares-Salor Benet^ORCID,Sánchez David^ORCID,Lison Pierre^ORCID

Abstract

AbstractThe availability of textual data depicting human-centered features and behaviors is crucial for many data mining and machine learning tasks. However, data containing personal information should be anonymized prior making them available for secondary use. A variety of text anonymization methods have been proposed in the last years, which are standardly evaluated by comparing their outputs with human-based anonymizations. The residual disclosure risk is estimated with the recall metric, which quantifies the proportion of manually annotated re-identifying terms successfully detected by the anonymization algorithm. Nevertheless, recall is not a risk metric, which leads to several drawbacks. First, it requires a unique ground truth, and this does not hold for text anonymization, where several masking choices could be equally valid to prevent re-identification. Second, it relies on human judgements, which are inherently subjective and prone to errors. Finally, the recall metric weights terms uniformly, thereby ignoring the fact that the influence on the disclosure risk of some missed terms may be much larger than of others. To overcome these drawbacks, in this paper we propose a novel method to evaluate the disclosure risk of anonymized texts by means of an automated re-identification attack. We formalize the attack as a multi-class classification task and leverage state-of-the-art neural language models to aggregate the data sources that attackers may use to build the classifier. We illustrate the effectiveness of our method by assessing the disclosure risk of several methods for text anonymization under different attack configurations. Empirical results show substantial privacy risks for most existing anonymization methods.

Funder

European Commission

Norges Forskningsråd

Ministerio de Ciencia, Innovación y Universidades

Departament d'Innovació, Universitats i Empresa, Generalitat de Catalunya

Instituto Nacional de Ciberseguridad

Universitat Rovira i Virgili

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1007/s10618-024-01066-3.pdf

Reference78 articles.

1. Aberdeen J, Bayer S, Yeniterzi R, Wellner B, Clark C, Hanauer D, Malin B, Hirschman L (2010) The MITRE identification scrubber toolkit: design, training, and assessment. Int J Med Informatics 79:849–859. https://doi.org/10.1016/j.ijmedinf.2010.09.007

2. Abril D, Navarro-Arribas G, Torra V (2012) Improving record linkage with supervised learning for disclosure risk assessment. Info Fus 13:274–284

3. Abril D, Torra V, Navarro-Arribas G (2015) Supervised learning using a symmetric bilinear form for record linkage. Info Fus 26:144–153. https://doi.org/10.1016/j.inffus.2014.11.004

4. Agrawal S, Haritsa JR, Prakash BA (2009) FRAPP: a framework for high-accuracy privacy-preserving mining. Data Min Knowl Disc 18:101–139. https://doi.org/10.1007/s10618-008-0119-9

5. Anandan B, Clifton C, Jiang W, Murugesan M, Pastrana-Camacho P, Si L (2012) t-Plausibility: generalizing words to desensitize text. Trans Data Priv 5:505–534