Using natural language processing to extract plant functional traits from unstructured text

Author:

Domazetoski ViktorORCID,Kreft HolgerORCID,Bestova Helena,Wieder PhilippORCID,Koynov Radoslav,Zarei Alireza,Weigelt PatrickORCID

Abstract

AbstractFunctional plant ecology aims to understand how functional traits govern the distribution of species along environmental gradients, the assembly of communities, and ecosystem functions and services. The rapid rise of functional plant ecology has been fostered by the mobilization and integration of global trait datasets, but significant knowledge gaps remain about the functional traits of the ∼380,000 vascular plant species worldwide. The acquisition of urgently needed information through field campaigns remains challenging, time-consuming and costly. An alternative and so far largely untapped resource for trait information is represented by texts in books, research articles and on the internet which can be mobilized by modern machine learning techniques.Here, we propose a natural language processing (NLP) pipeline that automatically extracts trait information from an unstructured textual description of a species and provides a confidence score. To achieve this, we employ textual classification models for categorical traits and question answering models for numerical traits. We demonstrate the proposed pipeline on five categorical traits (growth form, life cycle, epiphytism, climbing habit and life form), and three numerical traits (plant height, leaf length, and leaf width). We evaluate the performance of our new NLP pipeline by comparing results obtained using different alternative modeling approaches ranging from a simple keyword search to large language models, on two extensive databases, each containing more than 50,000 species descriptions.The final optimized pipeline utilized a transformer architecture to obtain a mean precision of 90.8% (range 81.6-97%) and a mean recall of 88.6% (77.4-97%) on the categorical traits, which is an average increase of 21.4% in precision and 57.4% in recall compared to a standard approach using regular expressions. The question answering model for numerical traits obtained a normalized mean absolute error of 10.3% averaged across all traits.The NLP pipeline we propose has the potential to facilitate the digitalization and extraction of large amounts of plant functional trait information residing in scattered textual descriptions. Additionally, our study adds to an emerging body of NLP applications in an ecological context, opening up new opportunities for further research at the intersection of these fields.

Publisher

Cold Spring Harbor Laboratory

Reference64 articles.

1. Antonelli, A. , Fry, C. , Smith, R.J. , Eden, J. , Govaerts, R.H.A. , Kersey, P. , Nic Lughadha, E. , …, A., Zuntini, A.R. (2023). State of the World’s Plants and Fungi 2023. Royal Botanic Gardens, Kew.

2. Antoun, W. , Baly, F. , & Hajj, H. (2020). Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104. beautifulsoup4 4.11.1 Retrieved from https://pypi.org/project/beautifulsoup4/

3. Beltagy, I. , Lo, K. , & Cohan, A. (2019). SciBERT: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.

4. Language models are few-shot learners;Advances in neural information processing systems,2020

5. Global trait–environment relationships of plant communities;Nature ecology & evolution,2018

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3