Improving dictionary-based named entity recognition with deep learning-Reference-Cited by-同舟云学术

Improving dictionary-based named entity recognition with deep learning

Published:2024-09-01 Issue:Supplement_2 Volume:40 Page:ii45-ii52
ISSN:1367-4803
Container-title:Bioinformatics
language:en
Short-container-title:

Author:

Nastou Katerina¹^ORCID,Koutrouli Mikaela¹,Pyysalo Sampo²,Jensen Lars Juhl¹

Affiliation:

1. Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen , Blegdamsvej 3 , Copenhagen, 2200, Denmark

2. TurkuNLP Group, Department of Computing, University of Turku , Turku, 20014, Finland

Abstract

Abstract Motivation Dictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly. Results In this work, we aim to improve block list s by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types (genes, diseases, species, and chemicals), which were used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score = 96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. In addition, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8.5% for chemical and 7.5% for gene names, positively affecting several biological databases utilizing this NER system, like the STRING database, with only a minor drop in recall (0.6%). Availability and implementation All resources are available through Zenodo https://doi.org/10.5281/zenodo.11243139 and GitHub https://doi.org/10.5281/zenodo.10289360.

Funder

Novo Nordisk Foundation

Academy of Finland

European Union’s Horizon 2020

Marie Sklodowska-Curie

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/bioinformatics/article-pdf/40/Supplement_2/ii45/59017053/btae402.pdf

Reference35 articles.

1. COMPARTMENTS: unification and visualization of protein subcellular localization evidence;Binder;Database,2014

2. Complex event extraction at Pubmed scale;Björne;Bioinformatics,2010

3. PMC text mining subset in bioc: about three million full-text articles and growing;Comeau;Bioinformatics,2019

4. STRING v9.1: protein–protein interaction networks, with increased coverage and integration;Franceschini;Nucleic Acids Res,2012