Pathwise coordinate optimization -
arXiv:0708.1485v2 [stat.CO] 14 Dec 2007 The Annals of Applied Statistics 2007, Vol. 1, No. 2, 302���332 DOI: 10.1214/07-AOAS131 c circlecopyrt Institute of Mathematical Statistics, 2007 PATHWISE COORDINATE OPTIMIZATION By Jerome Friedman,1 Trevor Hastie,2 Holger Hofling3�� and Robert Tibshirani4 Stanford University We consider ���one-at-a-time��� coordinate-wise descent algorithms for a class of convex optimization problems. An algorithm of this kind has been proposed for the L1-penalized regression (lasso) in the liter- ature, but it seems to have been largely ignored. Indeed, it seems that coordinate-wise algorithms are not often used in convex optimization. We show that this algorithm is very competitive with the well-known LARS (or homotopy) procedure in large lasso problems, and that it can be applied to related methods such as the garotte and elastic net. It turns out that coordinate-wise descent does not work in the ���fused lasso,��� however, so we derive a generalized algorithm that yields the solution in much less time that a standard convex optimizer. Finally, we generalize the procedure to the two-dimensional fused lasso, and demonstrate its performance on some image smoothing problems. 1. Introduction. In this paper we consider statistical models that lead to convex optimization problems with inequality constraints. Typically, the optimization for these problems is carried out using a standard quadratic programming algorithm. The purpose of this paper is to explore ���one-at-a- time��� coordinate-wise descent algorithms for these problems. The equivalent of a coordinate descent algorithm has been proposed for the L1-penalized regression (lasso) in the literature, but it is not commonly used. Moreover, coordinate-wise algorithms seem too simple, and they are not often used in convex optimization, perhaps because they only work in specialized prob- lems. We ourselves never appreciated the value of coordinate descent meth- ods for convex statistical problems before working on this paper. In this paper we show that coordinate descent is very competitive with the well-known LARS (or homotopy) procedure in large lasso problems, can Received May 2007 revised August 2007. 1Supported in part by NSF Grant DMS-97-64431. 2Supported in part by NSF Grant DMS-05-50676 and NIH Grant 2R01 CA 72028-07. 3Supported by an Albion Walter Hewlett Stanford Graduate Fellowship. 4Supported in part by NSF Grant DMS-99-71405 and NIH Contract N01-HV-28183. Key words and phrases. Coordinate descent, lasso, convex optimization. This is an electronic reprint of the original article published by the Institute of Mathematical Statistics in The Annals of Applied Statistics, 2007, Vol. 1, No. 2, 302���332. This reprint differs from the original in pagination and typographic detail. 1
2 J. FRIEDMAN, T. HASTIE, H. HOFLING �� AND R. TIBSHIRANI deliver a path of solutions efficiently, and can be applied to many other convex statistical problems such as the garotte and elastic net. We then go on to explore a nonseparable problem in which coordinate-wise descent does not work���the ���fused lasso.��� We derive a generalized algorithm that yields the solution in much less time that a standard convex optimizer. Finally, we generalize the procedure to the two-dimensional fused lasso, and demonstrate its performance on some image smoothing problems. A key point here: coordinate descent works so well in the class of problems that we consider because each coordinate minimization can be done quickly, and the relevant equations can be updated as we cycle through the variables. Furthermore, often the minimizers for many of the parameters don���t change as we cycle through the variables, and hence, the iterations are very fast. Consider, for example, the lasso for regularized regression [Tibshirani (1996)]. We have predictors xij,j = 1,2,...,p, and outcome values yi for the ith observation, for i = 1,2,...,n. Assume that the xij are standardized so that ��� i xij/n = 0, ��� i xij 2 = 1. The lasso solves min �� 1 2 n summationdisplay i=1 parenleftBigg yi ��� p summationdisplay j=1 xij��j parenrightBigg2 (1) subject to summationdisplayp j=1 |��j| ��� s. The bound s is a user-specified parameter, often chosen by a model selection procedure such as cross-validation. Equivalently, the solution to (1) also minimizes the ���Lagrange��� version of the problem f(��) = 1 2 n summationdisplay i=1 parenleftBigg yi ��� p summationdisplay j=1 xij��j parenrightBigg2 + �� summationdisplayp j=1 |��j|, (2) where �� ��� 0. There is a one-to-one correspondence between �� and the bound s���if ��(��) �� minimizes (2), then it also solves (1) with s = ���p j=1 |��j(��)|. �� In the signal processing literature, the lasso and L1 penalization is known as ���basis pursuit��� [Chen et al. (1998)]. There are efficient algorithms for solving this problem for all values of s or �� see Efron et al. (2004), and the homotopy algorithm of [Osborne et al. (2000)]. There is another, simpler algorithm for solving this problem for a fixed value ��. It relies on solving a sequence of single-parameter problems, which are assumed to be simple to solve. With a single predictor, the lasso solution is very simple, and is a soft- thresholded version [Donoho and Johnstone (1995)] of the least squares es- timate ��: �� ��lasso(��) �� = S(��,��) �� ��� sign(��)(|��| �� �� ��� ��)+ (3)
PATHWISE COORDINATE OPTIMIZATION 3 Fig. 1. The lasso problem with a single standardized predictor leads to soft thresholding. In each case the solid vertical line indicates the lasso estimate, and the broken line the least-squares estimate. = ��� ��� ��� ��� ��� �� �� ��� ��, if �� �� 0 and �� |��|, �� �� �� + ��, if �� �� 0 and �� |��|, �� 0, if �� ��� |��|. �� (4) This simple expression arises because the convex-optimization problem (2) reduces to a few special cases when there is a single predictor. Minimizing the criterion (2) with a single standardized x and �� simplifies to the equivalent problem min �� 1 2 (�� ��� ��)2 �� + ��|��|, (5) where �� �� = ��� i xiyi is the simple least-squares coefficient. If �� 0, we can differentiate (5) to get df d�� = �� ��� �� �� + �� = 0. (6) This leads to the solution �� = ����� �� �� (left panel of Figure 1) as long as �� �� 0 and �� ��, �� otherwise 0 is the minimizing solution (right panel). Similarly, if �� �� 0, if �� �����, �� then the solution is �� = �� �� + ��, else 0. With multiple predictors that are uncorrelated, it is easily seen that once again the lasso solutions are soft-thresholded versions of the individual least squares estimates. This is not the case for general (correlated) predictors. Consider instead a simple iterative algorithm that applies soft-thresholding with a ���partial residual��� as a response variable. We write (2) as f(��) �� = 1 2 n summationdisplay i=1 parenleftBigg yi ��� summationdisplay k=j xik �� �� k ��� xij��j parenrightBigg2 + �� summationdisplay k=j |��j| �� + ��|��j|, (7)
4 J. FRIEDMAN, T. HASTIE, H. HOFLING �� AND R. TIBSHIRANI Fig. 2. Diabetes data: iterates for each coefficient from algorithm (9). The algorithm converges to the lasso estimates shown on the right side of the plot. where all the values of ��k for k = j are held fixed at values ��k(��). �� Minimizing w.r.t. ��j, we get �� �� j(��) ��� S parenleftBigg n summationdisplay i=1 xij(yi ��� ��ij)),�� y ( parenrightBigg , (8) where ��ij) y ( = ��� k=j xik ��k(��). �� This is simply the univariate regression coef- ficient of the partial residual yi ��� ��ij) y ( on the (unit L2-norm) jth variable hence, this has the same form as the univariate version (3) above. The up- date (7) is repeated for j = 1,2,...,p,1,2,... until convergence. An equivalent form of this update is �� �� j(��) ��� S parenleftBigg �� �� j(��) + summationdisplayn i=1 xij(yi ��� ��i),�� y parenrightBigg , j = 1,2,...p,1,2,.... (9) Starting with any values for ��j, for example, the univariate regression coefficients, it can be shown that the �� �� j(��) values converge to ��lasso. �� Figure 2 shows an example, using the diabetes data from Efron et al. (2004). This data has 442 observations and 10 predictors. We applied al- gorithm (9) with �� = 88. It produces the iterates shown in the figure and converged after 14 steps to the lasso solution ��lasso(88). �� This approach provides a simple but fast algorithm for solving the lasso, especially useful for large p. It was proposed in the ���shooting��� procedure of Fu (1998) and re-discovered by [Daubechies, Defrise and De Mol (2004)]. Application of the same idea to the elastic net procedure [Zhou and Hastie (2005)] was proposed by [Van der Kooij (2007)].
PATHWISE COORDINATE OPTIMIZATION 5 2. Pathwise coordinatewise optimization algorithms. The procedure de- scribed in Section 1 is a successive coordinate-wise descent algorithm for minimizing the function f(��) = 1 2 ��� i (yi ��� ��� j xij��j)2 + �� ���p j=1 |��j|. The idea is to apply a coordinate-wise descent procedure for each value of the reg- ularization parameter, varying the regularization parameter along a path. Each solution is used as a warm start for the next problem. This approach is attractive whenever the single-parameter problem is easy to solve. Some of these algorithms have already been proposed in the liter- ature, and we give appropriate references. Here are some other examples: ��� The nonnegative garotte. This method, a precursor to the lasso, was pro- posed by Breiman (1995) and solves min c 1 2 n summationdisplay i=1 parenleftBigg yi ��� p summationdisplay j=1 xijcj �� �� j parenrightBigg2 + �� summationdisplayp j=1 cj subject to cj ��� 0. (10) Here �� �� j are the usual least squares estimates (we assume that p ��� n). Using partial residuals as in (7), one can show that the the coordinate- wise update has the form cj ��� parenleftbigg ��j �� ��j �� ��� �� ��2 �� j parenrightbigg + , (11) where ��j �� = ���n i=1 xij(yi ��� ��ij)), y ( and ��ij) y ( = ��� k=j xikck ��k. �� ��� Least absolute deviation regression and LAD-lasso. Here the problem is min �� n summationdisplayvextendsingle i=1 vextendsingle vextendsingle vextendsingleyi vextendsingle ��� ��0 ��� p summationdisplay j=1 xij��jvextendsingle.vextendsingle vextendsingle vextendsingle vextendsingle (12) We can write this as n summationdisplay i=1 |xij|vextendsingle vextendsingle vextendsingle vextendsingle(yi ��� ��ij)) y ( xij ��� ��jvextendsingle,vextendsingle vextendsingle vextendsingle (13) holding all but ��j fixed. This quantity is minimized over ��j by a weighted median of the values (yi ��� ��ij))/xij. y ( Hence, coordinate-wise minimization is just a repeated computation of medians. This approach is studied and refined by Li and Arce (2004). The same approach may be used in the LAD-lasso [Wang et al. (2006)]. Here we add an L1 penalty to the least absolute deviation loss: min �� n summationdisplayvextendsingleyi i=1 vextendsingle vextendsingle vextendsingle vextendsingle ��� ��0 ��� p summationdisplay j=1 xij��jvextendsingle vextendsingle vextendsingle vextendsingle vextendsingle + �� summationdisplayp j=1 |��j|. (14) This can be converted into an LAD problem (12) by augmenting the dataset with p artificial observations. In detail, let X denote n �� (p +