Automatic Curation of Court Documents: Anonymizing Personal Data-Reference-Cited by-同舟云学术

Automatic Curation of Court Documents: Anonymizing Personal Data

Published:2022-01-10 Issue:1 Volume:13 Page:27
ISSN:2078-2489
Container-title:Information
language:en
Short-container-title:Information

Author:

Garat Diego,Wonsever Dina^ORCID

Abstract

In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a better experience of querying and navigation. In all cases, the curation process constitutes a bottleneck that slows down general access to the data, so it is of great interest to have automatic or semi-automatic curation processes. In this paper, we present a solution aimed at the automatic curation of our National Jurisprudence Database, with special focus on the process of the anonymization of personal information. The anonymization process aims to hide the names of the participants involved in a lawsuit without losing the meaning of the narrative of facts. In order to achieve this goal, we need, not only to recognize person names but also resolve co-references in order to assign the same label to all mentions of the same person. Our corpus has significant differences in the spelling of person names, so it was clear from the beginning that pre-existing tools would not be able to reach a good performance. The challenge was to find a good way of injecting specialized knowledge about person names syntax while taking profit of previous capabilities of pre-trained tools. We fine-tuned an NER analyzer and we built a clusterization algorithm to solve co-references between named entities. We present our first results, which, for both tasks, are promising: We obtained a 90.21% of F1-micro in the NER task—from a 39.99% score before retraining the same analyzer in our corpus—and a 95.95% ARI score in clustering for co-reference resolution.

Funder

Agencia Nacional de Investigación e Innovación

Publisher

MDPI AG

Subject

Information Systems

Link

https://www.mdpi.com/2078-2489/13/1/27/pdf

Reference53 articles.

1. Protección de Datos Personales y Acción de “Habeas Data”,2008

2. Replacing personally-identifying information in medical records, the Scrub system;Sweeney;AMIA Annu. Symp. Proc.,1996

3. A Joint Model for Entity Analysis: Coreference, Typing, and Linking

4. Entity Linking

5. Natural Language Processing with Python;Bird,2009

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Text mining and machine learning for crime classification: using unstructured narrative court documents in police academic;Cogent Engineering;2024-06-03

2. Procedure informatiche di tutela della trasparenza e riservatezza dei dati;Studi e saggi;2024

3. An offline English optical character recognition and NER using LSTM and adaptive neuro-fuzzy inference system;Journal of Intelligent & Fuzzy Systems;2023-03-09

4. Towards a human-in-the-loop curation: A qualitative perspective;2022 IEEE/ACS 19th International Conference on Computer Systems and Applications (AICCSA);2022-12

5. Toward Privacy Preservation Using Clustering Based Anonymization: Recent Advances and Future Research Outlook;IEEE Access;2022