Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits-Reference-Cited by-同舟云学术

Validating spatio-temporal environmental machine learning models: Simpson’s paradox and data splits

Published:2024-03-01 Issue:3 Volume:6 Page:031003
ISSN:2515-7620
Container-title:Environmental Research Communications
language:
Short-container-title:Environ. Res. Commun.

Author:

Boser Anna^ORCID

Abstract

Abstract Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r 2 of 0.09 can falsely appear to achieve an r 2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s r 2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.

Funder

National Science Foundation

Publisher

IOP Publishing

Link

https://iopscience.iop.org/article/10.1088/2515-7620/ad2e44/pdf

Reference56 articles.

1. Ambient PM2.5 reduces global and regional life expectancy;Apte;Environmental Science & Technology Letters,2018

2. The Correlation Coefficient: An Overview;Asuero;Critical Reviews in Analytical Chemistry,2006

3. How Important is the Train-Validation Split in Meta-Learning?;Bai,2021

4. The changing nature of wildfires: impacts on the health of the public;Balmes;Clinics in Chest Medicine,2020

5. Environmental justice: the economics of race, place, and pollution;Banzhaf;J. Econ. Perspect.,2019

Cited by 1 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Spatial and Spatiotemporal Modeling of Intra-Urban Ultrafine Particles: A Comparison of Linear, Nonlinear, Regularized, and Machine Learning Methods;2024