Abstract
In developing partial least squares calibration models, selecting the number of latent variables used for their construction to minimize both model bias and model variance remains a challenge. Several metrics exist for incorporating these trade-offs, but the cost of model parsimony and the potential for underfitting on achievable prediction errors are difficult to anticipate. We propose a metric that penalizes growing model variance against decreasing bias as additional latent variables are added. The magnitude of the penalty is scaled by a user-defined parameter that is formulated to provide a constraint on the fractional increase in root mean square error of cross-validation (RMSECV) when selecting a parsimonious model over the conventional minimum RMSECV solution. We evaluate this approach for quantification of four organic functional groups using 238 laboratory standards and 750 complex atmospheric organic aerosol mixtures with mid-infrared spectroscopy. Parametric variation of this penalty demonstrates that increase in prediction errors due to underfitting is bounded by the magnitude of the penalty for samples similar to laboratory standards used for model training and validation. Imposing an ensemble of penalties corresponding to a 0-30% allowable increase in RMSECV through sum of ranking differences leads to the selection of a model that increases the actual RMSECV up to 20% for laboratory standards but achieves an 85% reduction in the mean error in predicted concentrations for environmental mixtures. Partial least squares models developed with laboratory mixtures can provide useful predictions in complex environmental samples, but may benefit from protection against overfitting. © 2015 The Authors. Journal of Chemometrics published by John Wiley & Sons Ltd. A new metric for weighing model bias and variance is proposed for model selection in partial least squares regression. The metric is defined by a penalty parameter that specifies the permissible increase in the minimum achievable prediction error as parsimonious solutions are explored, and an ensemble of penalty values are considered within a consensus scoring framework. This approach is shown to be useful when extrapolating calibration models developed with laboratory standards to complex environmental mixtures.
Author supplied keywords
Cite
CITATION STYLE
Takahama, S., & Dillner, A. M. (2015). Model selection for partial least squares calibration and implications for analysis of atmospheric organic aerosol samples with mid-infrared spectroscopy. Journal of Chemometrics, 29(12), 659–668. https://doi.org/10.1002/cem.2761
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.