Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification-Reference-Cited by-同舟云学术

Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification

Published:2019-01-18 Issue:2 Volume:11 Page:185
ISSN:2072-4292
Container-title:Remote Sensing
language:en
Short-container-title:Remote Sensing

Author:

A. Ramezan Christopher,A. Warner Timothy^ORCID,E. Maxwell Aaron

Abstract

High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided.

Publisher

MDPI AG

Subject

General Earth and Planetary Sciences

Link

http://www.mdpi.com/2072-4292/11/2/185/pdf

Reference57 articles.

1. Importance of sample size, data type and prediction method for remote sensing-based estimations of aboveground forest biomass

2. Selecting Training Samples from Large-Scale Remote-Sensing Samples Using an Active Learning Algorithm;Guo,2016

3. A survey of image classification methods and techniques for improving classification performance

4. Sample size determination for image classification accuracy assessment and comparison

5. Assessing the impact of training sample selection on accuracy of an urban classification: a case study in Denver, Colorado

Cited by 181 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Hotter, drier climate influences tropical tree cover loss and promotes bracken fern dominance within arrested successional patches in Andean Cloud Forests;Biological Conservation;2024-09

2. Leveraging multi-omics and machine learning approaches in malting barley research: From farm cultivation to the final products;Current Plant Biology;2024-09

3. Accurate diagnosis of acute appendicitis in the emergency department: an artificial intelligence-based approach;Internal and Emergency Medicine;2024-08-21

4. SWIR based estimation of TIR emissivity of bare soil surfaces using deep conditional generative adversarial network in Landsat data;Plant and Soil;2024-08-06

5. Edge-protected IDW-based DEM detail enhancement and 3D terrain visualization;Computers & Graphics;2024-08