COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature-Reference-Cited by-同舟云学术

COPIOUS: A gold standard corpus of named entities towards extracting species occurrence from biodiversity literature

Published:2019-01-22 Issue: Volume:7 Page:
ISSN:1314-2828
Container-title:Biodiversity Data Journal
language:
Short-container-title:BDJ

Author:

Nguyen Nhung^ORCID,Gabud Roselyn,Ananiadou Sophia

Abstract

Background Species occurrence records are very important in the biodiversity domain. While several available corpora contain only annotations of species names or habitats and geographical locations, there is no consolidated corpus that covers all types of entities necessary for extracting species occurrence from biodiversity literature. In order to alleviate this issue, we have constructed the COPIOUS corpus—a gold standard corpus that covers a wide range of biodiversity entities. Results Two annotators manually annotated the corpus with five categories of entities, i.e. taxon names, geographical locations, habitats, temporal expressions and person names. The overall inter-annotator agreement on 200 doubly-annotated documents is approximately 81.86% F-score. Amongst the five categories, the agreement on habitat entities was the lowest, indicating that this type of entity is complex. The COPIOUS corpus consists of 668 documents downloaded from the Biodiversity Heritage Library with over 26K sentences and more than 28K entities. Named entity recognisers trained on the corpus could achieve an F-score of 74.58%. Moreover, in recognising taxon names, our model performed better than two available tools in the biodiversity domain, namely the SPECIES tagger and the Global Name Recognition and Discovery. More than 1,600 binary relations of Taxon-Habitat, Taxon-Person, Taxon-Geographical locations and Taxon-Temporal expressions were identified by applying a pattern-based relation extraction system to the gold standard. Based on the extracted relations, we can produce a knowledge repository of species occurrences. Conclusion The paper describes in detail the construction of a gold standard named entity corpus for the biodiversity domain. An investigation of the performance of named entity recognition (NER) tools trained on the gold standard revealed that the corpus is sufficiently reliable and sizeable for both training and evaluation purposes. The corpus can be further used for relation extraction to locate species occurrences in literature—a useful task for monitoring species distribution and preserving the biodiversity.

Publisher

Pensoft Publishers

Subject

Ecology,Ecology, Evolution, Behavior and Systematics

Link

https://bdj.pensoft.net/article/29626/download/pdf/

Reference49 articles.

1. NetiNeti: discovery of scientific names from text using machine learning methods

2. Supporting Biological Pathway Curation Through Text Mining

3. An overview of the BioCreative 2012 Workshop Track III: interactive text mining task

4. Interoperability of corpus processing workflow engines: the case of. AlvisNLP/ML in OpenMinTeD;Ba;Proceedings of the Workshops on Cross-Platform Text Mining and Natural Language Processing Interoperability (INTEROP 2016),2016

Cited by 18 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The changing landscape of text mining: a review of approaches for ecology and evolution;Proceedings of the Royal Society B: Biological Sciences;2024-07-31

2. TaeC: A manually annotated text dataset for trait and phenotype extraction and entity linking in wheat breeding literature;PLOS ONE;2024-06-13

3. Unsupervised literature mining approaches for extracting relationships pertaining to habitats and reproductive conditions of plant species;Frontiers in Artificial Intelligence;2024-05-23

4. Large language models help facilitate the automated synthesis of information on potential pest controllers;Methods in Ecology and Evolution;2024-05-20

5. CoastTerm: A Corpus for Multidisciplinary Term Extraction in Coastal Scientific Literature;Lecture Notes in Computer Science;2024