Abstract
In order to provide open access to data of public interest, it is often necessary to perform several data curation processes. In some cases, such as biological databases, curation involves quality control to ensure reliable experimental support for biological sequence data. In others, such as medical records or judicial files, publication must not interfere with the right to privacy of the persons involved. There are also interventions in the published data with the aim of generating metadata that enable a better experience of querying and navigation. In all cases, the curation process constitutes a bottleneck that slows down general access to the data, so it is of great interest to have automatic or semi-automatic curation processes. In this paper, we present a solution aimed at the automatic curation of our National Jurisprudence Database, with special focus on the process of the anonymization of personal information. The anonymization process aims to hide the names of the participants involved in a lawsuit without losing the meaning of the narrative of facts. In order to achieve this goal, we need, not only to recognize person names but also resolve co-references in order to assign the same label to all mentions of the same person. Our corpus has significant differences in the spelling of person names, so it was clear from the beginning that pre-existing tools would not be able to reach a good performance. The challenge was to find a good way of injecting specialized knowledge about person names syntax while taking profit of previous capabilities of pre-trained tools. We fine-tuned an NER analyzer and we built a clusterization algorithm to solve co-references between named entities. We present our first results, which, for both tasks, are promising: We obtained a 90.21% of F1-micro in the NER task—from a 39.99% score before retraining the same analyzer in our corpus—and a 95.95% ARI score in clustering for co-reference resolution.
Funder
Agencia Nacional de Investigación e Innovación
Reference53 articles.
1. Protección de Datos Personales y Acción de “Habeas Data”,2008
2. Replacing personally-identifying information in medical records, the Scrub system;Sweeney;AMIA Annu. Symp. Proc.,1996
3. A Joint Model for Entity Analysis: Coreference, Typing, and Linking
4. Entity Linking
5. Natural Language Processing with Python;Bird,2009
Cited by
5 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献