Improving dictionary-based named entity recognition with deep learning-Reference-Cited by-同舟云学术

Improving dictionary-based named entity recognition with deep learning

Published:2023-12-11 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Nastou Katerina^ORCID,Koutrouli Mikaela^ORCID,Pyysalo Sampo^ORCID,Jensen Lars Juhl^ORCID

Abstract

AbstractDictionary-based named entity recognition (NER) allows terms to be detected in a corpus and normalized to biomedical databases and ontologies. However, adaptation to different entity types requires new high-quality dictionaries and associated lists of blocked names for each type. The latter are so far created by identifying cases that cause many false positives through manual inspection of individual names, a process that scales poorly.In this work we aim to improve block lists by automatically identifying names to block, based on the context in which they appear. By comparing results of three well-established biomedical NER methods, we generated a dataset of over 12.5 million text spans where the methods agree on the boundaries and type of entity tagged. These were used to generate positive and negative examples of contexts for four entity types, namely genes, diseases, species and chemicals, which were then used to train a Transformer-based model (BioBERT) to perform entity type classification. Application of the best model (F1-score=96.7%) allowed us to generate a list of problematic names that should be blocked. Introducing this into our system doubled the size of the previous list of corpus-wide blocked names. Additionally, we generated a document-specific list that allows ambiguous names to be blocked in specific documents. These changes boosted text mining precision by ∼5.5% on average, and over 8% for chemical and gene names, positively affecting several biological databases utilizing this NER system to extract associations between biomedical entities, like the STRING database, with only a minor drop in recall (0.6%).

Publisher

Cold Spring Harbor Laboratory

Reference37 articles.

1. Gene Ontology: tool for the unification of biology

2. J. X. Binder , S. Pletscher-Frankild , K. Tsafou , C. Stolte , S. I. O’Donoghue , R. Schneider , and L. J. Jensen . Compartments: unification and visualization of protein subcellular localization evidence. Database, 2014, 2014.

3. Complex event extraction at PubMed scale

4. J. S. Bridle . Probabilistic interpretation of feedforward classification network outputs, with relationships to statistical pattern recognition. In Neurocomputing, pages 227–236. Springer, 1990.

5. Pmc text mining subset in bioc: about three million fulltext articles and growing;Bioinformatics,2019

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Lifestyle factors in the biomedical literature: comprehensive resources for named entity recognition;2024-06-16