Named Entity Recognition for Sensitive Data Discovery in Portuguese-Reference-Cited by-同舟云学术

Named Entity Recognition for Sensitive Data Discovery in Portuguese

Published:2020-03-27 Issue:7 Volume:10 Page:2303
ISSN:2076-3417
Container-title:Applied Sciences
language:en
Short-container-title:Applied Sciences

Author:

Dias Mariana^ORCID,Boné João,Ferreira João C.^ORCID,Ribeiro Ricardo,Maia Rui

Abstract

The process of protecting sensitive data is continually growing and becoming increasingly important, especially as a result of the directives and laws imposed by the European Union. The effort to create automatic systems is continuous, but, in most cases, the processes behind them are still manual or semi-automatic. In this work, we have developed a component that can extract and classify sensitive data, from unstructured text information in European Portuguese. The objective was to create a system that allows organizations to understand their data and comply with legal and security purposes. We studied a hybrid approach to the problem of Named Entity Recognition for the Portuguese language. This approach combines several techniques such as rule-based/lexical-based models, machine learning algorithms, and neural networks. The rule-based and lexical-based approaches were used only for a set of specific classes. For the remaining classes of entities, two statistical models were tested—Conditional Random Fields and Random Forest and, finally, a Bidirectional-LSTM approach as experimented. Regarding the statistical models, we realized that Conditional Random Fields is the one that can obtain the best results, with a f1-score of 65.50%. With the Bi-LSTM approach, we have achieved a result of 83.01%. The corpora used for training and testing were HAREM Golden Collection, SIGARRA News Corpus, and DataSense NER Corpus.

Publisher

MDPI AG

Subject

Fluid Flow and Transfer Processes,Computer Science Applications,Process Chemistry and Technology,General Engineering,Instrumentation,General Materials Science

Link

https://www.mdpi.com/2076-3417/10/7/2303/pdf

Reference41 articles.

1. Handbook of Natural Language Processing,2000

2. A survey of named entity recognition and classification;Nadeau;Lingvist. Investig.,2007

3. Private data discovery for privacy compliance in collaborative environments;Korba,2008

Cited by 24 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The Development of a Named Entity Recognizer for Detecting Personal Information Using a Korean Pretrained Language Model;Applied Sciences;2024-06-28

2. Sensitive data identification for multi‐category and multi‐scenario data;Transactions on Emerging Telecommunications Technologies;2024-04-25

3. IoT-AID: An Automated Decision Support Framework for IoT;SN Computer Science;2024-04-13

4. UP-SDCG: A Method of Sensitive Data Classification for Collaborative Edge Computing in Financial Cloud Environment;Future Internet;2024-03-18

5. Uma revisão para o Reconhecimento de Entidades Nomeadas aplicado à língua portuguesa;Linguamática;2023-12-30