Coping with imbalanced data problem in digital mapping of soil classes

Author:

Sharififar Amin1ORCID,Sarmadian Fereydoon2

Affiliation:

1. The James Hutton Institute Aberdeen UK

2. Department of Soil Science, School of Agricultural Engineering and Technology University of Tehran Karaj Iran

Abstract

AbstractAn unsolved problem in the digital mapping of categorical soil variables and soil types is the imbalanced number of observations, which leads to reduced accuracy and the loss of the minority class (the class with a significantly lower number of observations compared to other classes) in the final map. So far, synthetic over‐ and under‐sampling techniques have been explored in soil science; however, more efficient approaches that do not have the drawbacks of these techniques and guarantee retention of the minority classes in the produced map are essentially required. Such approaches suggested in the present study for digital mapping of soil classes include machine learning models of ensemble gradient boosting, cost‐sensitive learning and one‐class classification (OCC) of the minority class combined with multi‐class classification. In this regard, extreme gradient boosting (XGB) as an ensemble gradient learner, a cost‐sensitive decision tree (CSDT) within the C5.0 algorithm, and a one‐class support vector machine combined with multi‐class classification (OCCM) were investigated to map eight soil great groups with a naturally imbalanced frequency of observations in northwest Iran. A total of 453 profile data points were used for mapping the soil great groups of the study area. A data split was done manually for each class separately, which resulted in an overall 70% of the data for calibration and 30% for validation. The bootstrapping approach of calibration (with 10 runs) was performed to produce multiple maps for each model. The 10 bootstraps were evaluated against the hold‐out validation dataset. The average values of accuracy measures, including Kappa (K), overall accuracy (OA), producer's accuracy (PA) and user's accuracy (UA), were explored. In addition, the results of this study were compared with a previous study in the same area, in which resampling techniques were used to deal with imbalanced data for digital soil class mapping. The findings show that all three suggested methods can deal well with the imbalanced classification problem, with OCCM showing the highest K (= 0.76) and OA (= 82) in the validation stage. Also, this model can guarantee the retention of the minority classes in the final map. Comparing the present approaches with the previous study approach demonstrates that the three newly suggested methods can remarkably increase both overall and individual class accuracy for mapping.

Publisher

Wiley

Subject

Soil Science

Cited by 3 articles. 订阅此论文施引文献 订阅此论文施引文献,注册后可以免费订阅5篇论文的施引文献,订阅后可以查看论文全部施引文献

同舟云学术

1.学者识别学者识别

2.学术分析学术分析

3.人才评估人才评估

"同舟云学术"是以全球学者为主线,采集、加工和组织学术论文而形成的新型学术文献查询和分析系统,可以对全球学者进行文献检索和人才价值评估。用户可以通过关注某些学科领域的顶尖人物而持续追踪该领域的学科进展和研究前沿。经过近期的数据扩容,当前同舟云学术共收录了国内外主流学术期刊6万余种,收集的期刊论文及会议论文总量共计约1.5亿篇,并以每天添加12000余篇中外论文的速度递增。我们也可以为用户提供个性化、定制化的学者数据。欢迎来电咨询!咨询电话:010-8811{复制后删除}0370

www.globalauthorid.com

TOP

Copyright © 2019-2024 北京同舟云网络信息技术有限公司
京公网安备11010802033243号  京ICP备18003416号-3