A comparison of coordinate descent, generalized gradient(ISTA) and Nesterov's momentum method (FISTA) for the lasso regression problem, and L1-regularized logistic regression Aug 18, 2010 Coordinate descent: we used the glmnet fortran package (Friedman, Hastie & Tibshirani) ISTA, FISTA- we coded efficient algorithms in Fortran Problem setup: N obs, p vars, pop'n correlation 0 or 0.5. coefficients: for gaussian |beta_j|=exp[ -.5*(u*(j-1))^2] u=sqrt(pi/20) sign(beta1)=1, sign(beta2)=-1, sign(beta3)=1 etc noise stand dev sigma chosen so that SNR= sd(E(y)/sigma)=3 for logistic; 15 non-zero coefficients, alternating signs, |betaj|=15-j+1 20 values of regularization parameter lambda, convergence threshold = 1.0e-4 (max abs change in coef), flmin = 0.01 warm starting for all methods methods in more detail: generalized gradient and Nesterov used backtracking (see fixed t comparisons at bottom of this document) grad1 = no momentum, step size t not reset to 1.0 at each step grad2 = no momentum, step size reset to 1.0 at each step grad_mom1 = momentum, step size not reset to 1.0 at each step grad_mom2 = momentum, step size reset to 1.0 at each step opt = 0 => 2nd order info used opt = 1 => 2nd order info not used Results time in sec., ave over 5 runs numbers after time = iterations of outer loop * = fastest gradient method Gaussian p = 100, N=10000, pop. corr = 0.0 glmnet grad1 grad2* grad_mom1 grad_mom2 0.14 0.21(80) 0.19(71) 0.26(90) 0.27(89) p = 100, N=10000, pop. corr =0.5 glmnet grad1 grad2 grad_mom1* grad_mom2 0.45 1.06(416) 1.05(249) 0.83(275) 1.56(290) p = 10000, N=200, pop. corr = 0.0 glmnet grad1 grad2* grad_mom1 grad_mom2 0.14 1.66(382) 1.26(258) 1.94(443) 1.95(385) p = 10000, N=200, pop. corr = 0.5 glmnet grad1 grad2* grad_mom1 grad_mom2 0.43 4.11(676) 2.49(222) 4.29(767) 5.00(660) Logistic p = 100, N=10000, pop. corr = 0.0 opt glmnet grad2 grad_mom1 grad_mom2 0 0.62 2.23(398) 1.72(281) 1.68(227) 1 1.27 3.28(940) 1.51(393)* 1.52(393) p = 100, N=10000, pop. corr = 0.5 opt glmnet grad2 grad_mom1 grad_mom2 0 0.67 3.37(310) 5.57(783) 4.65(343) 1 1.84 4.11(976) 4.19(907) 2.84(503)* p = 10000, N=200, pop. corr = 0.0 opt glmnet grad2 grad_mom1 grad_mom2 0 0.62 6.59(708) 10.4(1199) 7.54(769) 1 0.70 6.00(1383) 5.64(1306) 4.47(989)* p = 10000, N=200, pop. corr = 0.5 opt glmnet grad2 grad_mom1 grad_mom2 0 0.83 8.19(798) 28.4(3087) 15.0(1338) 1 0.76 9.55(2080) 14.5(3173) 8.05(1544)* CONCLUSIONS: 1. coordinate descent is the clear winner, especially when p>>n and especially for the logistic model 2. momentum seems to speed up generalized gradient, when the predictors are correlated Generalized gradient descent: backtracking vs fixed step results 20 lambda values, flmin = 0.01, warm starting convergence threshold = 1.0e-5 Notes: (1) threshold = 1.0e-4 does not work with fixed step and correlations. It works fine with backtracking (2) For fixed step, step size was optimized separately for each problem Optimal step size is second value in parentheses (3) iterations of outer loop is first value in parentheses Time (sec.) backtracking fixed step size no momentum momentum no momentum momentum p = 100, N=10000, pop. corr = 0.0 0.278(109) 0.403(139) 0.209(80,1.0) 0.278(108,1.0) p = 100, N=10000, pop. corr =0.5 2.84(1157) 2.23(799) 5.76(2476,0.06) 5.26(2252,0.04) p = 10000, N=200, pop. corr =0.0 9.03(2144) 9.03(2112) 9.02(2180,0.15) 9.06(2100,0.12) p = 10000, N=200, pop. corr =0.5 25.31(5717) 24.97(5562) 78.16(17890,0.0005) 39.51(8884,0.0012) Conclusions: (1) Backtracking is competitive with fixed OPTIMAL step size. In practice any predetermined fixed step size would likely be quite conservative, leading to greatly increased steps and time. (2) For the same number of steps both methods take the same amount of time. Thus, the backtracking part of the algorithm is taking only a relatively small amount of the overall computation.