A comparison of various methods for multivariate regression with highly collinear variables
- ISSN: 16182510
- DOI: 10.1007/s10260-006-0025-5
Abstract
Tends to give very unstable and unreliable regression weights when predictors are highly collinear. Several methods have been proposed to counter this problem. A subset of these do so by finding components that summarize the information in the predictors and the criterion variables. The present paper compares six such methods (two of which are almost completely new) to ordinary regression: Partial least Squares (PLS), Principal Component regression (PCR), Principle covariates regression, reduced rank regression, and two variants of what is called power regression. The comparison is mainly done by means of a series of simulation studies, in which data are constructed in various ways, with different degrees of collinearity and noise, and the methods are compared in terms of their capability of recovering the population regression weights, as well as their prediction quality for the complete population. It turns out that recovery of regression weights in situations with collinearity is often very poor by all methods, unless the regression weights lie in the subspace spanning the first few principal components of the predictor variables. In those cases, typically PLS and PCR give the best recoveries of regression weights. The picture is inconclusive, however, because, especially in the study with more real life like simulated data, PLS and PCR gave the poorest recoveries of regression weights in conditions with relatively low noise and collinearity. It seems that PLS and PCR are particularly indicated in cases with much collinearity, whereas in other cases it is better to use ordinary regression. As far as prediction is concerned: Prediction suffers far less from collinearity than recovery of the regression weights.
A comparison of various methods for multivariate regression with highly collinear variables
DOI 10.1007/s10260-006-0025-5
ORIGINAL ARTICLE
A comparison of various methods for multivariate
regression with highly collinear variables
Henk A. L. Kiers · Age K. Smilde
Accepted: 23 August 2006 / Published online: 27 October 2006
© Springer-Verlag 2006
Abstract Regression tends to give very unstable and unreliable regression
weights when predictors are highly collinear. Several methods have been pro-
posed to counter this problem. A subset of these do so by finding components
that summarize the information in the predictors and the criterion variables. The
present paper compares six such methods (two of which are almost completely
new) to ordinary regression: Partial least Squares (PLS), Principal Component
regression (PCR), Principle covariates regression, reduced rank regression, and
two variants of what is called power regression. The comparison is mainly done
by means of a series of simulation studies, in which data are constructed in
various ways, with different degrees of collinearity and noise, and the methods
are compared in terms of their capability of recovering the population regres-
sion weights, as well as their prediction quality for the complete population. It
turns out that recovery of regression weights in situations with collinearity is
often very poor by all methods, unless the regression weights lie in the subspace
spanning the first few principal components of the predictor variables. In those
cases, typically PLS and PCR give the best recoveries of regression weights. The
picture is inconclusive, however, because, especially in the study with more real
life like simulated data, PLS and PCR gave the poorest recoveries of regres-
sion weights in conditions with relatively low noise and collinearity. It seems
that PLS and PCR are particularly indicated in cases with much collinearity,
whereas in other cases it is better to use ordinary regression.As far as prediction
H. A. L. Kiers (
B
)
Heymans Institute, University of Groningen, Groningen, The Netherlands
e-mail: H.A.L.Kiers@rug.nl
A. K. Smilde
Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam,
The Netherlands
is concerned: Prediction suffers far less from collinearity than recovery of the
regression weights.
Keywords Multivariate regression · PLS · Principal component regression ·
Principal covariate regression · Power regression · Multicollinearity
1 Introduction
Many problems in applied sciences can be cast in the framework of a regression
problem. Such a regression model is then used to relate a set of predictors
(independent variables) to a criterion variable (dependent variable). Examples
of such problems are abundant. In process analysis, on-line (near-) infrared
spectroscopy can be used to predict concentrations of compounds in a pro-
cess stream (Van Sprang 2002) or to predict octane numbers of gasoline (Kelly
et al. 1989). In chemical engineering, regression models are used to create
inferential sensors (Kresta et al. 1994). A completely different application is
strain improvement in biotechnology: regression models are used to relate con-
centrations of metabolites (small chemical compounds) to productivity of a
biotechnological process (Van der Werf 2005).
A regression model can serve several purposes. In process analysis and
chemical engineering applications, the purpose is almost exclusively prediction.
Hence, the regressionmodel generates a prediction rule, relating the ‘easy’mea-
surements (predictors) to the ‘complicated’ one (criterion), thereby avoiding
the necessity of having to measure the ‘complicated’ criterion (e.g. an octane
number). In other applications, e.g., the biotechnology example referred to
earlier, the focus is on understanding the relationship between the predictors
and the criterion variable. Hence, the regression weights are more important.
Ideally, these weights could be used to understand, e.g., the relative importance
of the metabolites for the productivity of a certain strain.
Ordinary regression aims at finding an optimal rule for predicting scores on a
criterion variable on the basis of scores on a number of predictor variables. The
prediction rule is obtained by analyzing data on a training sample, for which
scores on both the predictor variables and the criterion variable are available,
and finding that linear combination of variables that approximates most closely
the scores on the criterion variable. The regression weights then define the
prediction rule, which is meant to be useful in situations where it is desired to
estimate the unknown scores on a criterion variable, while only the scores on
the predictor variables are available. Obviously, the usefulness of a prediction
rule does not reside in its performance for the data on the basis of which it was
obtained, but in its performance on other (e.g., future) data. In other words, a
regression rule should primarily have good generalizability properties. More-
over, for applications in which the regression weights are important, the weights
found by the model should reflect the true underlying phenomena.
It is well known that the prediction rule resulting from ordinary multiple
regression is rather prone to lack of generalizability, as will be explained in
Sign up today - FREE
Mendeley saves you time finding and organizing research. Learn more
- All your research in one place
- Add and import papers easily
- Access it anywhere, anytime




