Sign up & Download
Sign in

Least Angle Regression

by Bradley Efron, Trevor Hastie, Iain Johnstone, Robert Tibshirani
Annals of Statistics ()

Abstract

The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possible covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regression (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algorithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients; the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method;

Cite this document (BETA)

Available from arxiv.org
Page 1
hidden

Least Angle Regression -

arXiv:math/0406456v2 [math.ST] 30 Jun 2004 The Annals of Statistics 2004, Vol. 32, No. 2, 407���451 DOI: 10.1214/009053604000000067 c circlecopyrt Institute of Mathematical Statistics, 2004 LEAST ANGLE REGRESSION By Bradley Efron1, Trevor Hastie2, Iain Johnstone3 and Robert Tibshirani4 Stanford University The purpose of model selection algorithms such as All Subsets, Forward Selection and Backward Elimination is to choose a linear model on the basis of the same set of data to which the model will be applied. Typically we have available a large collection of possi- ble covariates from which we hope to select a parsimonious set for the efficient prediction of a response variable. Least Angle Regres- sion (LARS), a new model selection algorithm, is a useful and less greedy version of traditional forward selection methods. Three main properties are derived: (1) A simple modification of the LARS algo- rithm implements the Lasso, an attractive version of ordinary least squares that constrains the sum of the absolute regression coefficients the LARS modification calculates all possible Lasso estimates for a given problem, using an order of magnitude less computer time than previous methods. (2) A different LARS modification efficiently implements Forward Stagewise linear regression, another promising new model selection method this connection explains the similar nu- merical results previously observed for the Lasso and Stagewise, and helps us understand the properties of both methods, which are seen as constrained versions of the simpler LARS algorithm. (3) A sim- ple approximation for the degrees of freedom of a LARS estimate is available, from which we derive a Cp estimate of prediction error this allows a principled choice among the range of possible LARS estimates. LARS and its variants are computationally efficient: the paper describes a publicly available algorithm that requires only the same order of magnitude of computational effort as ordinary least squares applied to the full set of covariates. Received March 2002 revised January 2003. 1Supported in part by NSF Grant DMS-00-72360 and NIH Grant 8R01-EB002784. 2Supported in part by NSF Grant DMS-02-04162 and NIH Grant R01-EB0011988-08. 3Supported in part by NSF Grant DMS-00-72661 and NIH Grant R01-EB001988-08. 4Supported in part by NSF Grant DMS-99-71405 and NIH Grant 2R01-CA72028. AMS 2000 subject classification. 62J07. Key words and phrases. Lasso, boosting, linear regression, coefficient paths, variable selection. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Statistics, 2004, Vol. 32, No. 2, 407���451. This reprint differs from the original in pagination and typographic detail. 1
Page 2
hidden
2 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI 1. Introduction. Automatic model-building algorithms are familiar, and sometimes notorious, in the linear model literature: Forward Selection, Back- ward Elimination, All Subsets regression and various combinations are used to automatically produce ���good��� linear models for predicting a response y on the basis of some measured covariates x1,x2,...,xm. Goodness is often defined in terms of prediction accuracy, but parsimony is another important criterion: simpler models are preferred for the sake of scientific insight into the x���y relationship. Two promising recent model-building algorithms, the Lasso and Forward Stagewise linear regression, will be discussed here, and motivated in terms of a computationally simpler method called Least Angle Regression. Least Angle Regression (LARS) relates to the classic model-selection method known as Forward Selection, or ���forward stepwise regression,��� de- scribed in Weisberg [(1980), Section 8.5]: given a collection of possible predic- tors, we select the one having largest absolute correlation with the response y, say xj1 , and perform simple linear regression of y on xj1 . This leaves a residual vector orthogonal to xj1 , now considered to be the response. We project the other predictors orthogonally to xj1 and repeat the selection process. After k steps this results in a set of predictors xj1 ,xj2 ,...,xjk that are then used in the usual way to construct a k-parameter linear model. For- ward Selection is an aggressive fitting technique that can be overly greedy, perhaps eliminating at the second step useful predictors that happen to be correlated with xj1 . Forward Stagewise, as described below, is a much more cautious version of Forward Selection, which may take thousands of tiny steps as it moves toward a final model. It turns out, and this was the original motivation for the LARS algorithm, that a simple formula allows Forward Stagewise to be implemented using fairly large steps, though not as large as a classic Forward Selection, greatly reducing the computational burden. The geometry of the algorithm, described in Section 2, suggests the name ���Least Angle Regres- sion.��� It then happens that this same geometry applies to another, seemingly quite different, selection method called the Lasso [Tibshirani (1996)]. The LARS���Lasso���Stagewise connection is conceptually as well as computation- ally useful. The Lasso is described next, in terms of the main example used in this paper. Table 1 shows a small part of the data for our main example. Ten baseline variables, age, sex, body mass index, average blood pressure and six blood serum measurements, were obtained for each of n = 442 dia- betes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline. The statisticians were asked to construct a model that predicted response y from covariates x1,x2,...,x10. Two hopes were evident here, that the model would produce accurate base- line predictions of response for future patients and that the form of the model
Page 3
hidden
LEAST ANGLE REGRESSION 3 would suggest which covariates were important factors in disease progres- sion. The Lasso is a constrained version of ordinary least squares (OLS). Let x1,x2, ...,xm be n-vectors representing the covariates, m = 10 and n = 442 in the diabetes study, and let y be the vector of responses for the n cases. By location and scale transformations we can always assume that the covari- ates have been standardized to have mean 0 and unit length, and that the response has mean 0, n summationdisplay i=1 yi = 0, n summationdisplay i=1 xij = 0, summationdisplayn i=1 xij 2 = 1 for j = 1,2,...,m. (1.1) This is assumed to be the case in the theory which follows, except that numerical results are expressed in the original units of the diabetes example. A candidate vector of regression coefficients hatwide �� = (��1,��2,...,��m)��� hatwide hatwide hatwide gives prediction vector ��,hatwide hatwide �� = summationdisplaym j=1 xj��j hatwide = X�� hatwide [Xn��m = (x1,x2,...,xm)] (1.2) with total squared error S(��) hatwide = bardbly ��� hatwide ��bardbl2 = summationdisplayn i=1 (yi ��� ��i)2.hatwide (1.3) Let T(��) hatwide be the absolute norm of ��,hatwide T(��) hatwide = summationdisplaym j=1 |��j|.hatwide (1.4) Table 1 Diabetes study: 442 diabetes patients were measured on 10 baseline variables a prediction model was desired for the response variable, a measure of disease progression one year after baseline AGE SEX BMI BP Serum measurements Response Patient x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 y 1 59 2 32.1 101 157 93.2 38 4 4.9 87 151 2 48 1 21.6 87 183 103.2 70 3 3.9 69 75 3 72 2 30.5 93 156 93.6 41 4 4.7 85 141 4 24 1 25.3 84 198 131.4 40 5 4.9 89 206 5 50 1 23.0 101 192 125.4 52 4 4.3 80 135 6 23 1 22.6 89 139 64.8 61 2 4.2 68 97 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 441 36 1 30.0 95 201 125.2 42 5 5.1 85 220 442 36 1 19.6 71 250 133.2 97 3 4.6 92 57
Page 4
hidden
4 EFRON, HASTIE, JOHNSTONE AND TIBSHIRANI Fig. 1. Estimates of regression coefficients hatwide ��j , j = for the diabetes study. (Left panel) Lasso estimates, as a function of t = ���1,2,...,10, j |��j|. hatwide The covariates enter the re- gression equation sequentially as t increases, in order j = 3,9,4,7,...,1. (Right panel) The same plot for Forward Stagewise Linear Regression. The two plots are nearly identical, but differ slightly for large t as shown in the track of covariate 8. The Lasso chooses hatwide �� by minimizing S(��) hatwide subject to a bound t on T(��),hatwide Lasso: minimize S(��) hatwide subject to T(��) hatwide ��� t. (1.5) Quadratic programming techniques can be used to solve (1.5) though we will present an easier method here, closely related to the ���homotopy method��� of Osborne, Presnell and Turlach (2000a). The left panel of Figure 1 shows all Lasso solutions hatwide ��(t) for the diabetes study, as t increases from 0, where hatwide �� = 0, to t = 3460.00, where hatwide �� equals the OLS regression vector, the constraint in (1.5) no longer binding. We see that the Lasso tends to shrink the OLS coefficients toward 0, more so for small val- ues of t. Shrinkage often improves prediction accuracy, trading off decreased variance for increased bias as discussed in Hastie, Tibshirani and Friedman (2001). The Lasso also has a parsimony property: for any given constraint value t, only a subset of the covariates have nonzero values of hatwide ��j. At t = 1000, for example, only variables 3, 9, 4 and 7 enter the Lasso regression model (1.2). If this model provides adequate predictions, a crucial question considered in Section 4, the statisticians could report these four variables as the important ones. Forward Stagewise Linear Regression, henceforth called Stagewise, is an iterative technique that begins with hatwide �� = 0 and builds up the regression
Page 5
hidden
LEAST ANGLE REGRESSION 5 function in successive small steps. If hatwide �� is the current Stagewise estimate, let c(��) hatwide be the vector of current correlations hatwide c = c(��) hatwide = X���(y ��� ��),hatwide (1.6) so that hatwidej c is proportional to the correlation between covariate xj and the current residual vector. The next step of the Stagewise algorithm is taken in the direction of the greatest current correlation, hatwide j = argmax|cj| hatwide and hatwide �� ��� hatwide �� + �� �� sign(c��) hatwide j �� x��, j (1.7) with �� some small constant. ���Small��� is important here: the ���big��� choice �� = |c��| hatwide j leads to the classic Forward Selection technique, which can be overly greedy, impulsively eliminating covariates which are correlated with x��. j The Stagewise procedure is related to boosting and also to Friedman���s MART al- gorithm [Friedman (2001)] see Section 8, as well as Hastie, Tibshirani and Friedman [(2001), Chapter 10 and Algorithm 10.4]. The right panel of Figure 1 shows the coefficient plot for Stagewise applied to the diabetes data. The estimates were built up in 6000 Stagewise steps [making �� in (1.7) small enough to conceal the ���Etch-a-Sketch��� staircase seen in Figure 2, Section 2]. The striking fact is the similarity between the Lasso and Stagewise estimates. Although their definitions look completely different, the results are nearly, but not exactly, identical. The main point of this paper is that both Lasso and Stagewise are variants of a basic procedure called Least Angle Regression, abbreviated LARS (the ���S��� suggesting ���Lasso��� and ���Stagewise���). Section 2 describes the LARS algorithm while Section 3 discusses modifications that turn LARS into Lasso or Stagewise, reducing the computational burden by at least an order of magnitude for either one. Sections 5 and 6 verify the connections stated in Section 3. Least Angle Regression is interesting in its own right, its simple structure lending itself to inferential analysis. Section 4 analyzes the ���degrees of free- dom��� of a LARS regression estimate. This leads to a Cp type statistic that suggests which estimate we should prefer among a collection of possibilities like those in Figure 1. A particularly simple Cp approximation, requiring no additional computation beyond that for the hatwide �� vectors, is available for LARS. Section 7 briefly discusses computational questions. An efficient S pro- gram for all three methods, LARS, Lasso and Stagewise, is available. Sec- tion 8 elaborates on the connections with boosting. 2. The LARS algorithm. Least Angle Regression is a stylized version of the Stagewise procedure that uses a simple mathematical formula to ac- celerate the computations. Only m steps are required for the full set of

Readership Statistics

765 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
38% Ph.D. Student
 
10% Student (Master)
 
10% Post Doc
by Country
 
30% United States
 
8% China
 
6% Germany

Tags

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in