Abstract
AbstractHyperspectral imaging has emerged as a pivotal tool to classify plant materials (seeds, leaves, and whole plants), pharmaceutical products, food items, and many other objects. This communication addresses two issues, which appear to be over-looked or ignored in >99% of hyperspectral imaging studies: 1) the “small N, large P” problem, when number of spectral bands (explanatory variables, “P”) surpasses number of observations, (“N”) leading to potential model over-fitting, and 2) absence of independent validation data in performance assessments of classification models. Based on simulations of randomly generated data, we illustrate risks associated with these issues. We explore and discuss consequences of over-fitting and risks of misleadingly high accuracy that can result from having a large number of variables relative to observations. We highlight connections of these issues with radiometric repeatability (levels of stochastic noise). A method is proposed wherein a theoretical dataset is generated to mirror the structure of an actual dataset, with the classification of this theoretical dataset serving as a reference. By shedding light on important and common experimental design issues, we aim to enhance methodological rigor and transparency in classifications of hyperspectral imaging data and foster improved and effective applications across various science domains.
Publisher
Cold Spring Harbor Laboratory