The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization-Reference-Cited by-同舟云学术

The Text Anonymization Benchmark (TAB): A Dedicated Corpus and Evaluation Framework for Text Anonymization

Published:2022 Issue:4 Volume:48 Page:1053-1101
ISSN:0891-2017
Container-title:Computational Linguistics
language:en
Short-container-title:

Author:

Pilán Ildikó¹,Lison Pierre²,Øvrelid Lilja³,Papadopoulou Anthi⁴,Sánchez David⁵,Batet Montserrat⁶

Affiliation:

1. Norwegian Computing Center Oslo, Norway. pilan@nr.no

2. Norwegian Computing Center Oslo, Norway. plison@nr.no

3. Language Technology Group University of Oslo, Norway. liljao@ifi.uio.no

4. Language Technology Group University of Oslo, Norway. anthip@ifi.uio.no

5. Universitat Rovira i Virgili, CYBERCAT UNESCO Chair in Data Privacy, Spain david.sanchez@urv.cat

6. Universitat Rovira i Virgili, CYBERCAT UNESCO Chair in Data Privacy, Spain. montserrat.batet@urv.cat

Abstract

Abstract We present a novel benchmark and associated evaluation metrics for assessing the performance of text anonymization methods. Text anonymization, defined as the task of editing a text document to prevent the disclosure of personal information, currently suffers from a shortage of privacy-oriented annotated text resources, making it difficult to properly evaluate the level of privacy protection offered by various anonymization methods. This paper presents TAB (Text Anonymization Benchmark), a new, open-source annotated corpus developed to address this shortage. The corpus comprises 1,268 English-language court cases from the European Court of Human Rights (ECHR) enriched with comprehensive annotations about the personal information appearing in each document, including their semantic category, identifier type, confidential attributes, and co-reference relations. Compared with previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected. Along with presenting the corpus and its annotation layers, we also propose a set of evaluation metrics that are specifically tailored toward measuring the performance of text anonymization, both in terms of privacy protection and utility preservation. We illustrate the use of the benchmark and the proposed metrics by assessing the empirical performance of several baseline text anonymization models. The full corpus along with its privacy-oriented annotation guidelines, evaluation scripts, and baseline models are available on: https://github.com/NorskRegnesentral/text-anonymization-benchmark.

Publisher

MIT Press

Subject

Artificial Intelligence,Computer Science Applications,Linguistics and Language,Language and Linguistics

Link

https://direct.mit.edu/coli/article-pdf/48/4/1053/2062009/coli_a_00458.pdf

Reference95 articles.

1. The MITRE identification scrubber toolkit: Design, training, and assessment;Aberdeen;International Journal of Medical Informatics,2010

2. Pseudonymisation of personal names and other PHIs in an annotated clinical Swedish corpus;Alfalahi,2012

3. Significance of term relationships on anonymization;Anandan,2011

4. t-plausibility: Generalizing words to desensitize text;Anandan;Transactions on Data Privacy,2012

5. Survey article: Inter-coder agreement for computational linguistics;Artstein;Computational Linguistics,2008

Cited by 19 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Exploring How UK Public Authorities Use Redaction to Protect Personal Information;ACM Transactions on Management Information Systems;2024-09-11

2. Evaluating the disclosure risk of anonymized documents via a machine learning-based re-identification attack;Data Mining and Knowledge Discovery;2024-09-03

3. Comparing Feature-based and Context-aware Approaches to PII Generalization Level Prediction;2024 International Conference on Asian Language Processing (IALP);2024-08-04

4. Privacy Preservation of Large Language Models in the Metaverse Era: Research Frontiers, Categorical Comparisons, and Future Directions;International Journal of Network Management;2024-07-29

5. LLM-PBE: Assessing Data Privacy in Large Language Models;Proceedings of the VLDB Endowment;2024-07