Potential of natural language processing for metadata extraction from environmental scientific publications-Reference-Cited by-同舟云学术

Potential of natural language processing for metadata extraction from environmental scientific publications

Published:2023-03-14 Issue:1 Volume:9 Page:155-168
ISSN:2199-398X
Container-title:SOIL
language:en
Short-container-title:SOIL

Author:

Blanchy Guillaume^ORCID,Albrecht Lukas,Koestel John^ORCID,Garré Sarah^ORCID

Abstract

Abstract. Summarizing information from large bodies of scientific literature is an essential but work-intensive task. This is especially true in environmental studies where multiple factors (e.g., soil, climate, vegetation) can contribute to the effects observed. Meta-analyses, studies that quantitatively summarize findings of a large body of literature, rely on manually curated databases built upon primary publications. However, given the increasing amount of literature, this manual work is likely to require more and more effort in the future. Natural language processing (NLP) facilitates this task, but it is not clear yet to which extent the extraction process is reliable or complete. In this work, we explore three NLP techniques that can help support this task: topic modeling, tailored regular expressions and the shortest dependency path method. We apply these techniques in a practical and reproducible workflow on two corpora of documents: the Open Tension-disk Infiltrometer Meta-database (OTIM) and the Meta corpus. The OTIM corpus contains the source publications of the entries of the OTIM database of near-saturated hydraulic conductivity from tension-disk infiltrometer measurements (https://github.com/climasoma/otim-db, last access: 1 March 2023). The Meta corpus is constituted of all primary studies from 36 selected meta-analyses on the impact of agricultural practices on sustainable water management in Europe. As a first step of our practical workflow, we identified different topics from the individual source publications of the Meta corpus using topic modeling. This enabled us to distinguish well-researched topics (e.g., conventional tillage, cover crops), where meta-analysis would be useful, from neglected topics (e.g., effect of irrigation on soil properties), showing potential knowledge gaps. Then, we used tailored regular expressions to extract coordinates, soil texture, soil type, rainfall, disk diameter and tensions from the OTIM corpus to build a quantitative database. We were able to retrieve the respective information with 56 % up to 100 % of all relevant information (recall) and with a precision between 83 % and 100 %. Finally, we extracted relationships between a set of drivers corresponding to different soil management practices or amendments (e.g., “biochar”, “zero tillage”) and target variables (e.g., “soil aggregate”, “hydraulic conductivity”, “crop yield”) from the source publications' abstracts of the Meta corpus using the shortest dependency path between them. These relationships were further classified according to positive, negative or absent correlations between the driver and the target variable. This quickly provided an overview of the different driver–variable relationships and their abundance for an entire body of literature. Overall, we found that all three tested NLP techniques were able to support evidence synthesis tasks. While human supervision remains essential, NLP methods have the potential to support automated evidence synthesis which can be continuously updated as new publications become available.

Funder

Horizon 2020

Publisher

Copernicus GmbH

Subject

Soil Science

Link

https://soil.copernicus.org/articles/9/155/2023/soil-9-155-2023.pdf

Reference24 articles.

1. Angeli, G., Johnson Premkumar, M. J., and Manning, C. D.: Leveraging Linguistic Structure For Open Domain Information Extraction, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Beijing, China, 344–354, https://doi.org/10.3115/v1/P15-1034, 2015.

2. Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., and Keizer, J.: The AGROVOC linked dataset, AGROVOC, 4, 341–348, 2013.

3. EJP SOIL – CLIMASOMA: CLIMASOMA – Final report Climate change adaptation through soil and crop management: Synthesis and ways forward, https://climasoma.curve.space/report (last access: 1 March 2023), 2022.

4. Furey, J., Davis, A., and Seiter-Moser, J.: Natural language indexing for pedoinformatics, Geoderma, 334, 49–54, https://doi.org/10.1016/j.geoderma.2018.07.050, 2019.

5. Haddaway, N. R., Callaghan, M. W., Collins, A. M., Lamb, W. F., Minx, J. C., Thomas, J., and John, D.: On the use of computer-assistance to facilitate systematic mapping, Campbell Systematic Reviews, 16, e1129, https://doi.org/10.1002/cl2.1129, 2020.

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. The soil knowledge library (KLIB) – a structured literature database on soil process research;SOIL;2023-10-17