Abstract
Given the biodiversity crisis, we more than ever need to access
information on multiple taxa (e.g. distribution, traits, diet) in the
scientific literature to understand, map and predict all-inclusive
biodiversity. Tools are needed to automatically extract useful
information from the ever-growing corpus of ecological texts and feed
this information to open data repositories. A prerequisite is the
ability to recognise mentions of taxa in text, a special case of named
entity recognition (NER). In recent years, deep learning-based NER
systems have become ubiquitous, yielding state-of-the-art results in the
general and biomedical domains. However, no such tool is available to
ecologists wishing to extract information from the biodiversity
literature.
We propose a new tool called TaxoNERD that provides two deep neural
network (DNN) models to recognise taxon mentions in ecological
documents. To achieve high performance, DNN-based NER models usually
need to be trained on a large corpus of manually annotated text.
Creating such a gold standard corpus (GSC) is a laborious and costly
process, with the result that GSCs in the ecological domain tend to be
too small to learn an accurate DNN model from scratch. To address this
issue, we leverage existing DNN models pretrained on large biomedical
corpora using transfer learning. The performance of our models is
evaluated on four GSCs and compared to the most popular taxonomic NER
tools.
Our experiments suggest that existing taxonomic NER tools are not
suited to the extraction of ecological information from text as they
performed poorly on ecologically-oriented corpora, either because they
do not take account of the variability of taxon naming practices, or
because they do not generalise well to the ecological domain.
Conversely, a domain-specific DNN-based tool like TaxoNERD outperformed
the other approaches on an ecological information extraction
task.
Efforts are needed in order to raise ecological information
extraction to the same level of performance as its biomedical
counterpart. One promising direction is to leverage the huge corpus of
unlabelled ecological texts to learn a language representation model
that could benefit downstream tasks. These efforts could be highly
beneficial to ecologists on the long term.
Publisher
Cold Spring Harbor Laboratory
Reference59 articles.
1. NetiNeti: discovery of scientific names from text using machine learning methods
2. SciBERT: A pretrained language model for scientific text;arXiv preprint,2019
3. Enriching word vectors with subword information;Transactions of the Association for Computational Linguistics,2017
4. Bossy, R. , Deléger, L. , Chaix, E. , Ba, M. and
Nédellec, C. (2019) Bacteria Biotope at BioNLP Open Shared Tasks 2019. In
Proceedings of The 5th Workshop on BioNLP Open Shared Tasks,
121–131.
5. Uncovering ecological patterns with convolutional neural networks;Trends in ecology & evolution,2019
Cited by
2 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献