The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop-Reference-Cited by-同舟云学术

The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop

Published:2024 Issue: Volume:2024 Page:
ISSN:1758-0463
Container-title:Database
language:en
Short-container-title:

Author:

Islamaj Rezarta¹^ORCID,Wei Chih-Hsuan¹^ORCID,Lai Po-Ting¹,Luo Ling²^ORCID,Coss Cathleen¹,Gokal Kochar Preeti¹,Miliaras Nicholas¹,Rodionov Oleg¹,Sekiya Keiko¹,Trinh Dorothy¹,Whitman Deborah¹,Lu Zhiyong¹

Affiliation:

1. National Library of Medicine (NLM), National Institutes of Health (NIH) , 8600 Rockville Pike, Bethesda, MD 20894, United States

2. School of Computer Science and Technology, Dalian University of Technology , No.2 Linggong Road, Ganjingzi District, Dalian, Liaoning 116024, China

Abstract

Abstract The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease–gene, chemical–gene, disease–variant, gene–gene, chemical–disease, chemical–chemical, chemical–variant, and variant–variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as ‘novel’ depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381

Funder

National Natural Science Foundation of China

the NIH Intramural Research Program, National Library of Medicine

Publisher

Oxford University Press (OUP)

Link

https://academic.oup.com/database/article-pdf/doi/10.1093/database/baae071/58790757/baae071.pdf

Reference50 articles.

1. Understanding PubMed® user search behavior through log analysis;Islamaj Dogan;Database,2009

2. Database resources of the national center for biotechnology information;Sayers;Nucleic Acids Res,2022

3. NCBI Taxonomy: a comprehensive update on curation, resources and tools;Schoch;Database,2020

4. dbSNP: the NCBI database of genetic variation;Sherry;Nucleic Acids Res,2001

5. ClinVar: improvements to accessing data;Landrum;Nucleic Acids Res,2020