Abstract
Abstract
Machine learning has revolutionized environmental sciences by estimating scarce environmental data, such as air quality, land cover type, wildlife population counts, and disease risk. However, current methods for validating these models often ignore the spatial or temporal structure commonly found in environmental data, leading to inaccurate evaluations of model quality. This paper outlines the problems that can arise from such validation methods and describes how to avoid erroneous assumptions about training data structure. In an example on air quality estimation, we show that a poor model with an r
2 of 0.09 can falsely appear to achieve an r
2 value of 0.73 by failing to account for Simpson’s paradox. This same model’s r
2 can further inflate to 0.82 when improperly splitting data. To ensure high-quality synthetic data for research in environmental science, justice, and health, researchers must use validation procedures that reflect the structure of their training data.
Funder
National Science Foundation
Reference56 articles.
1. Ambient PM2.5 reduces global and regional life expectancy;Apte;Environmental Science & Technology Letters,2018
2. The Correlation Coefficient: An Overview;Asuero;Critical Reviews in Analytical Chemistry,2006
3. How Important is the Train-Validation Split in Meta-Learning?;Bai,2021
4. The changing nature of wildfires: impacts on the health of the public;Balmes;Clinics in Chest Medicine,2020
5. Environmental justice: the economics of race, place, and pollution;Banzhaf;J. Econ. Perspect.,2019
Cited by
1 articles.
订阅此论文施引文献
订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献