Random Forests in Count Data Modelling: An Analysis of the Influence of Data Features and Overdispersion on Regression Performance

Author:

Mushagalusa Ciza Arsène12ORCID,Fandohan Adandé Belarmain13ORCID,Glèlè Kakaï Romain1ORCID

Affiliation:

1. Laboratoire de Biomathématiques et d’Estimations Forestières, Faculty of Agronomic Sciences, University of Abomey-Calavi, 04 PB 1525, Cotonou, Benin

2. Faculty of Agriculture and Environmental Sciences, Université Evangélique en Afrique (UEA), P. O. Box: 3323, Bukavu, Democratic Republic of the Congo

3. Unité de Recherche en Foresterie et Conservation des Bioressources, Ecole de Foresterie Tropicale, Université Nationale d’ Agriculture, BP 43, Kétou, Benin

Abstract

Machine learning algorithms, especially random forests (RFs), have become an integrated part of the modern scientific methodology and represent an efficient alternative to conventional parametric algorithms. This study aimed to assess the influence of data features and overdispersion on RF regression performance. We assessed the effect of types of predictors (100, 75, 50, and 20% continuous, and 100% categorical), the number of predictors (p = 816 and 24), and the sample size (N = 50, 250, and 1250) on RF parameter settings. We also compared RF performance to that of classical generalized linear models (Poisson, negative binomial, and zero-inflated Poisson) and the linear model applied to log-transformed data. Two real datasets were analysed to demonstrate the usefulness of RF for overdispersed data modelling. Goodness-of-fit statistics such as root mean square error (RMSE) and biases were used to determine RF accuracy and validity. Results revealed that the number of variables to be randomly selected for each split, the proportion of samples to train the model, the minimal number of samples within each terminal node, and RF regression performance are not influenced by the sample size, number, and type of predictors. However, the ratio of observations to the number of predictors affects the stability of the best RF parameters. RF performs well for all types of covariates and different levels of dispersion. The magnitude of dispersion does not significantly influence RF predictive validity. In contrast, its predictive accuracy is significantly influenced by the magnitude of dispersion in the response variable, conditional on the explanatory variables. RF has performed almost as well as the models of the classical Poisson family in the presence of overdispersion. Given RF’s advantages, it is an appropriate statistical alternative for counting data.

Funder

Carnegie Corporation of New York

Publisher

Hindawi Limited

Subject

Statistics and Probability

Reference59 articles.

1. A new model for over-dispersed count data: Poisson quasi-lindley regression model;E. Altun;Mathematical Sciences,2019

2. Can we improve the spatial predictions of seabed sediments? A case study of spatial interpolation of mud content across the southwest Australian margin

3. Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method

4. Credibility procedures: laplace’s generalization of bayes’ rule and the combination of collateral knowledge with observed data;A. L. Bailey;Proceedings of the Casualty Actuarial Society,1950

Cited by 1 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3