Plant Science Knowledge Graph Corpus: a gold standard entity and relation corpus for the molecular plant sciences

Author:

Lotreck Serena12ORCID,Segura Abá Kenia34ORCID,Lehti-Shiu Melissa D1ORCID,Seeger Abigail15ORCID,Brown Brianna N I1ORCID,Ranaweera Thilanka14ORCID,Schumacher Ally1ORCID,Ghassemi Mohammad6ORCID,Shiu Shin-Han1234ORCID

Affiliation:

1. Department of Plant Biology, Michigan State University , East Lansing, MI 48824 , USA

2. Department of Computational Mathematics, Science & Engineering, Michigan State University , East Lansing, MI 48824 , USA

3. Program in Genetics and Genome Sciences, Michigan State University , East Lansing, MI 48824 , USA

4. DOE-Great Lake Bioenergy Research Center, Michigan State University, East Lansing , MI 48824 , USA

5. Department of Statistics, University of Michigan , Ann Arbor, MI 48109 , USA

6. Department of Computer Science and Engineering, Michigan State University , East Lansing, MI 48824 , USA

Abstract

Abstract Natural language processing (NLP) techniques can enhance our ability to interpret plant science literature. Many state-of-the-art algorithms for NLP tasks require high-quality labelled data in the target domain, in which entities like genes and proteins, as well as the relationships between entities, are labelled according to a set of annotation guidelines. While there exist such datasets for other domains, these resources need development in the plant sciences. Here, we present the Plant ScIenCe KnowLedgE Graph (PICKLE) corpus, a collection of 250 plant science abstracts annotated with entities and relations, along with its annotation guidelines. The annotation guidelines were refined by iterative rounds of overlapping annotations, in which inter-annotator agreement was leveraged to improve the guidelines. To demonstrate PICKLE’s utility, we evaluated the performance of pretrained models from other domains and trained a new, PICKLE-based model for entity and relation extraction (RE). The PICKLE-trained models exhibit the second-highest in-domain entity performance of all models evaluated, as well as a RE performance that is on par with other models. Additionally, we found that computer science-domain models outperformed models trained on a biomedical corpus (GENIA) in entity extraction, which was unexpected given the intuition that biomedical literature is more similar to PICKLE than computer science. Upon further exploration, we established that the inclusion of new types on which the models were not trained substantially impacts performance. The PICKLE corpus is, therefore, an important contribution to training resources for entity and RE in the plant sciences.

Funder

National Science Foundation

the U.S. Department of Energy Great Lakes Bioenergy Research Center

Publisher

Oxford University Press (OUP)

Subject

Plant Science,Agronomy and Crop Science,Biochemistry, Genetics and Molecular Biology (miscellaneous),Modeling and Simulation

Reference39 articles.

1. Leveraging linguistic structure for open domain information extraction;Angeli,2015

2. Concept annotation in the CRAFT corpus;Bada;BMC Bioinformatics,2012

3. Inter-annotator agreement and the upper limit on;Boguslav,2017

4. Overview of the gene regulation network and the bacteria biotope tasks in BioNLP’13 shared task;Bossy;BMC Bioinformatics,2015

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3