Random forest and spatial cross-validation performance in predicting species abundance distributions-Reference-Cited by-同舟云学术

Random forest and spatial cross-validation performance in predicting species abundance distributions

Published:2024-06-28 Issue:1 Volume:13 Page:
ISSN:2193-2697
Container-title:Environmental Systems Research
language:en
Short-container-title:Environ Syst Res

Author:

Mushagalusa Ciza Arsène^ORCID,Fandohan Adandé Belarmain^ORCID,Glèlè Kakaï Romain^ORCID

Abstract

AbstractRandom forests (RF) have been widely used to predict spatial variables. Several studies have shown that spatial cross-validation (CV) methods consistently cause RF to yield larger prediction errors compared to standard CV methods. This study examined the impact of species characteristics and data features on the performance of the standard RF and spatial CV approaches for predicting species abundance distribution. It compared the standard 5-fold CV, design-based validation, and three different spatial CV methods, such as spatial buffering, environmental blocking, and spatial blocking. Validation samples were randomly selected for design-based validation without replacement. We evaluated their predictive performance (accuracy and discrimination metrics) using artificial species abundance data generated by a linear function of a constant term (

$$\beta _0$$

β 0 ) and a random error term following a zero-mean Gaussian process with a covariance matrix determined by an exponential correlation function. The model was tuned over multiple simulations to consider different mean levels of species abundance, spatial autocorrelation variation, and species detection probability. Here we found that the standard RF had poor predictive performance when spatial autocorrelation was high and the species probability of detection was low. Design-based validation and standard K-fold CV were found to be the most effective strategies for evaluating RF performance compared to spatial CV methods, even in the presence of high spatial autocorrelation and imperfect detection for random samples. For weakly or moderately clustered samples, they yielded good modelling efficiency but overestimated RF’s predictive power, while they overestimated modelling efficiency, predictive power, and accuracy for strongly clustered samples with high spatial autocorrelation. Globally, the checkerboard pattern in the allocation of blocks to folds in blocked spatial CV was found to be the most effective CV approach for clustered samples, whatever the degree of clustering, spatial autocorrelation, or species abundance class. The checkerboard pattern in spatial CV was found to be the best method for random or systematic samples with spatial autocorrelation, but less effective than non-spatial CV approaches. Failing to take data features into account when validating models can lead to unrealistic predictions of species abundance and related parameters and, therefore, incorrect interpretations of patterns and conclusions. Further research should explore the benefits of using blocked spatial K-fold CV with checkerboard assignment of blocks to folds for clustered samples with high spatial autocorrelation.

Funder

Deutscher Akademischer Austauschdienst

International Development Research Centre

Styrelsen för Internationellt Utvecklingssamarbete

Publisher

Springer Science and Business Media LLC

Link

https://link.springer.com/content/pdf/10.1186/s40068-024-00352-9.pdf

Reference96 articles.

1. Araújo MB, Pearson RG, Thuiller W, Erhard M (2005) Validation of species-climate impact models under climate change. Glob Change Biol 11(9):1504–1513. https://doi.org/10.1111/j.1365-2486.2005.01000.x

2. Arlot S, Celisse A (2010) A survey of cross-validation procedures for model selection. Stat Surv 4(none):40–79. https://doi.org/10.1214/09-SS054

3. Austin MP, Belbin L, Meyers JA, Doherty MD, Luoto M (2006) Evaluation of statistical models used for predicting plant species distributions: role of artificial data and theory. Ecol Model 199:197–216. https://doi.org/10.1016/j.ecolmodel.2006.05.023

4. Bahn V, McGill BJ (2013) Testing the predictive performance of distribution models. Oikos 122(3):321–331. https://doi.org/10.1111/j.1600-0706.2012.00299.x

5. Baldridge E, Harris DJ, Xiao X, White EP (2016) An extensive comparison of species-abundance distribution models. PeerJ 4:e2823. https://doi.org/10.7717/peerj.2823. (ISSN 2167-8359)