Variable ranking and selection with random forest for unbalanced data-Reference-Cited by-同舟云学术

Variable ranking and selection with random forest for unbalanced data

Published:2022 Issue: Volume:1 Page:
ISSN:2634-4602
Container-title:Environmental Data Science
language:en
Short-container-title:Environ. Data Science

Author:

Bradter Ute^ORCID,Altringham John D.^ORCID,Kunin William E.^ORCID,Thom Tim J.,O’Connell Jerome^ORCID,Benton Tim G.^ORCID

Abstract

Abstract When one or several classes are much less prevalent than another class (unbalanced data), class error rates and variable importances of the machine learning algorithm random forest can be biased, particularly when sample sizes are smaller, imbalance levels higher, and effect sizes of important variables smaller. Using simulated data varying in size, imbalance level, number of true variables, their effect sizes, and the strength of multicollinearity between covariates, we evaluated how eight versions of random forest ranked and selected true variables out of a large number of covariates despite class imbalance. The version that calculated variable importance based on the area under the curve (AUC) was least adversely affected by class imbalance. For the same number of true variables, effect sizes, and multicollinearity between covariates, the AUC variable importance ranked true variables still highly at the lower sample sizes and higher imbalance levels at which the other seven versions no longer achieved high ranks for true variables. Conversely, using the Hellinger distance to split trees or downsampling the majority class already ranked true variables lower and more variably at the larger sample sizes and lower imbalance levels at which the other algorithms still ranked true variables highly. In variable selection, a higher proportion of true variables were identified when covariates were ranked by AUC importances and the proportion increased further when the AUC was used as the criterion in forward variable selection. In three case studies, known species–habitat relationships and their spatial scales were identified despite unbalanced data.

Funder

Biotechnology and Biological Sciences Research Council

Nigel Bertram Charitable Trust

Publisher

Cambridge University Press (CUP)

Reference92 articles.

1. Classifying grass-dominated habitats from remotely sensed data: The influence of spectral resolution, acquisition time and the vegetation classification system on accuracy and thematic resolution

2. Evaluating effectiveness of down-sampling for stratified designs and unbalanced prevalence in Random Forest models of tree species distributions in Nevada

3. Schmid, H , Zbinden, N and Keller, V (2004) Überwachung der Bestandsentwicklung häufiger Brutvögel in der Schweiz, Schweizerische Vogelwarte, Sempach.

4. The Roles of Predation, Food and Agricultural Practice in Determining the Breeding Success of the Lapwing (Vanellus vanellus) on Upland Grasslands

Cited by 5 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Air Quality Class Prediction Using Machine Learning Methods Based on Monitoring Data and Secondary Modeling;Atmosphere;2024-04-30

2. Disaster relief supply chain network planning under uncertainty;Annals of Operations Research;2024-04-08

3. Modeling Wildland Firefighters’ Assessments of Structure Defensibility;Fire;2023-12-17

4. Multimethod approach to advance provenance determination of fish in stocked systems;Canadian Journal of Fisheries and Aquatic Sciences;2023-09-01

5. Flood Defense Standard Estimation Using Machine Learning and Its Representation in Large‐Scale Flood Hazard Modeling;Water Resources Research;2023-05