Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package

Author:

Hurley Alexander G.ORCID,Peters Richard L.,Pappas Christoforos,Steger David N.,Heinrich Ingo

Abstract

Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in data exploration, quality control and reproducibility are to be met. This can occur when conventional methods, such as generating and assessing diagnostic visualizations or tables, become unfeasible due to time and practicality constraints. Interactive processing can alleviate this issue, and is increasingly utilized to ensure that large data sets are diligently handled. However, recent interactive tools rarely enable data manipulation, may not generate reproducible outputs, or are typically data/domain-specific. We developed datacleanr, an interactive tool that facilitates best practices in data exploration, quality control (e.g., outlier assessment) and flexible processing for multiple tabular data types, including time series and georeferenced data. The package is open-source, and based on the R programming language. A key functionality of datacleanr is the “reproducible recipe”—a translation of all interactive actions into R code, which can be integrated into existing analyses pipelines. This enables researchers experienced with script-based workflows to utilize the strengths of interactive processing without sacrificing their usual work style or functionalities from other (R) packages. We demonstrate the package’s utility by addressing two common issues during data analyses, namely 1) identifying problematic structures and artefacts in hierarchically nested data, and 2) preventing excessive loss of data from ‘coarse,’ code-based filtering of time series. Ultimately, with datacleanr we aim to improve researchers’ workflows and increase confidence in and reproducibility of their results.

Funder

Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung

Helmholtz-Gemeinschaft

Publisher

Public Library of Science (PLoS)

Subject

Multidisciplinary

Reference43 articles.

1. Big questions, big science: Meeting the challenges of global ecology;D Schimel;Oecologia,2015

2. Big data and the future of ecology;SE Hampton;Frontiers in Ecology and the Environment,2013

3. Big data for forecasting the impacts of global change on plant communities;J Franklin;Global Ecology and Biogeography,2017

4. The FLUXNET2015 dataset and the ONEFlux processing pipeline for eddy covariance data;G Pastorello;Scientific Data,2020

5. ForC: A global database of forest carbon stocks and fluxes;KJ Anderson‐Teixeira;Ecology,2018

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3