CoNECo: A Corpus for Named Entity recognition and normalization of protein Complexes-Reference-Cited by-同舟云学术

CoNECo: A Corpus for Named Entity recognition and normalization of protein Complexes

Published:2024-05-21 Issue: Volume: Page:
ISSN:
Container-title:
language:
Short-container-title:

Author:

Nastou Katerina^ORCID,Koutrouli Mikaela^ORCID,Pyysalo Sampo^ORCID,Jensen Lars Juhl^ORCID

Abstract

AbstractMotivationDespite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus.ResultsWe introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1,621 documents with 2,052 entities, 1,976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F1-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature.AvailabilityAll resources, including the annotated corpus, training data, and code, are available to the community through Zenodohttps://zenodo.org/records/11263147and GitHubhttps://zenodo.org/records/10693653.

Publisher

Cold Spring Harbor Laboratory

Reference31 articles.

1. The Gene Ontology knowledgebase in 2023

2. Gene Ontology: tool for the unification of biology

3. Recommending plant taxa for supporting on-site species identification

4. Predicting protein functions using incomplete hierarchical labels

5. NCBI disease corpus: A resource for disease name recognition and concept normalization