The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data-Reference-Cited by-同舟云学术

The influence of training sample size on the accuracy of deep learning models for the prediction of soil properties with near-infrared spectroscopy data

Published:2020-11-17 Issue:2 Volume:6 Page:565-578
ISSN:2199-398X
Container-title:SOIL
language:en
Short-container-title:SOIL

Author:

Ng Wartini,Minasny Budiman,Mendes Wanderson de Sousa^ORCID,Demattê José Alexandre Melo

Abstract

Abstract. The number of samples used in the calibration data set affects the quality of the generated predictive models using visible, near and shortwave infrared (VIS–NIR–SWIR) spectroscopy for soil attributes. Recently, the convolutional neural network (CNN) has been regarded as a highly accurate model for predicting soil properties on a large database. However, it has not yet been ascertained how large the sample size should be for CNN model to be effective. This paper investigates the effect of the training sample size on the accuracy of deep learning and machine learning models. It aims at providing an estimate of how many calibration samples are needed to improve the model performance of soil properties predictions with CNN as compared to conventional machine learning models. In addition, this paper also looks at a way to interpret the CNN models, which are commonly labelled as a black box. It is hypothesised that the performance of machine learning models will increase with an increasing number of training samples, but it will plateau when it reaches a certain number, while the performance of CNN will keep improving. The performances of two machine learning models (partial least squares regression – PLSR; Cubist) are compared against the CNN model. A VIS–NIR–SWIR spectra library from Brazil, containing 4251 unique sites with averages of two to three samples per depth (a total of 12 044 samples), was divided into calibration (3188 sites) and validation (1063 sites) sets. A subset of the calibration data set was then created to represent a smaller calibration data set ranging from 125, 300, 500, 1000, 1500, 2000, 2500 and 2700 unique sites, which is equivalent to a sample size of approximately 350, 840, 1400, 2800, 4200, 5600, 7000 and 7650. All three models (PLSR, Cubist and CNN) were generated for each sample size of the unique sites for the prediction of five different soil properties, i.e. cation exchange capacity, organic carbon, sand, silt and clay content. These calibration subset sampling processes and modelling were repeated 10 times to provide a better representation of the model performances. Learning curves showed that the accuracy increased with an increasing number of training samples. At a lower number of samples (< 1000), PLSR and Cubist performed better than CNN. The performance of CNN outweighed the PLSR and Cubist model at a sample size of 1500 and 1800, respectively. It can be recommended that deep learning is most efficient for spectra modelling for sample sizes above 2000. The accuracy of the PLSR and Cubist model seems to reach a plateau above sample sizes of 4200 and 5000, respectively, while the accuracy of CNN has not plateaued. A sensitivity analysis of the CNN model demonstrated its ability to determine important wavelengths region that affected the predictions of various soil attributes.

Publisher

Copernicus GmbH

Subject

Soil Science

Link

https://soil.copernicus.org/articles/6/565/2020/soil-6-565-2020.pdf

Reference52 articles.

1. Acquarelli, J., van Laarhoven, T., Gerretzen, J., Tran, T. N., Buydens, L. M. C., and Marchiori, E.: Convolutional neural networks for vibrational spectroscopic data analysis, Anal. Chim. Acta, 954, 22–31, https://doi.org/10.1016/j.aca.2016.12.010, 2017.

2. Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., and Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems, Software available from tensorflow.org, available at: https://www.tensorflow.org/ (last access: 1 July 2019), 2015.

3. Barnes, R. J., Dhanoa, M. S., and Lister, S. J.: Standard Normal Variate Transformation and De-Trending of near-Infrared Diffuse Reflectance Spectra, Appl. Spectrosc., 43, 772–777, https://doi.org/10.1366/0003702894202201, 1989.

4. Bellinaso, H., Demattê, J. A. M., and Romeiro, S. A.: Soil Spectral Library and Its Use in Soil Classification, Rev. Bras. Cienc. Solo, 34, 861–870, https://doi.org/10.1590/S0100-06832010000300027, 2010.

5. Bendor, E. and Banin, A.: Near-Infrared Analysis as a Rapid Method to Simultaneously Evaluate Several Soil Properties, Soil Sci. Soc. Am. J., 59, 364–372, https://doi.org/10.2136/sssaj1995.03615995005900020014x, 1995.

Cited by 104 articles. 订阅此论文施引文献订阅此论文施引文献，注册后可以免费订阅5篇论文的施引文献，订阅后可以查看论文全部施引文献

1. Accurate prediction of hyaluronic acid concentration under temperature perturbations using near-infrared spectroscopy and deep learning;Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy;2024-09

2. Hand-feel soil texture classes and particle-size distribution as predictors of soil water content at field capacity. Further insights into the sources of uncertainty;CATENA;2024-09

3. Review of deep learning-based methods for non-destructive evaluation of agricultural products;Biosystems Engineering;2024-09

4. An innovative variant based on generative adversarial network (GAN): Regression GAN combined with hyperspectral imaging to predict pesticide residue content of Hami melon;Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy;2024-09

5. Quantifying uncertainty in the prediction of soil properties using mid-infrared spectra;Geoderma;2024-08