Affiliation:
1. Institute of Statistics and Mathematical Methods in Economics, Research Unit Computational Statistics TU Wien Vienna Austria
Abstract
ABSTRACTThe performance of multivariate calibration models ŷ = f(x) for the prediction of a numerical property y from a set of x‐variables depends on the type of scaling of the x‐variables. Common scaling methods are autoscaling (dividing the centered x by its standard deviation s) and Pareto scaling (dividing the centered x by sP with P = 0.5). The adjusted Pareto scaling presented here varies the exponent P between 0 (no scaling) and 1 (autoscaling) with the aim of obtaining an optimum prediction performance for ŷ. Related scaling methods based on the variable spread are range scaling and vast scaling; while level scaling is based on the location (central value) of the variable. These scaling methods and robust versions are compared for models created by partial least‐squares (PLS) regression. The applied strategy repeated double cross validation (rdCV) evaluates the model performance for test set objects and considers its variability. Results with three data sets from chemistry show: (a) the efficacy of the different scaling methods depends on the data structure; (b) optimization of the Pareto exponent P is recommended; (c) range scaling or vast scaling may be better than adjusted Pareto scaling; (d) in general a heuristic search for the best scaling method is advisable. Overall, the consideration of different variants of scaling allow for a flexible adjustment of the variable contributions to the calibration model.
Reference41 articles.
1. Repeated double cross validation
2. R A Language and Environment for Statistical Computing R Development Core Team Foundation for Statistical Computing(Vienna Austria 2023) http://www.r‐project.org.
3. TheplsPackage: Principal Component and Partial Least Squares Regression inR