Biostatistical Computing, PHC 6068

Ridge regression, Lasso and elastic net.

Zhiguang Huo (Caleb)

Wednesday October 4th, 2017

Outlines

linear regression model

\[Y = X\beta + \varepsilon\]

Solution to linear regression model

Problem with linear regression

Collinearity

n <- 100
set.seed(32611)
x1 <- rnorm(n,3)
x2 <- rnorm(n,5)
x3 <- x2 + rnorm(n,sd=0.1)
cor(x2, x3)
## [1] 0.9957515
x <- data.frame(x1,x2,x3)
y <- 2*x1 + 3*x2 + 4*x3 + rnorm(n, sd = 3)
xyData <- cbind(y,x)
lmFit <- lm(y~x1 + x2 + x3, data=xyData)
summary(lmFit)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = xyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2838  -1.7470   0.0118   1.8627   6.9231 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5342     1.7695  -0.867   0.3881    
## x1            2.2746     0.3116   7.299 8.39e-11 ***
## x2            5.2712     3.1574   1.669   0.0983 .  
## x3            1.8343     3.1201   0.588   0.5580    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.9 on 96 degrees of freedom
## Multiple R-squared:  0.8679, Adjusted R-squared:  0.8638 
## F-statistic: 210.3 on 3 and 96 DF,  p-value: < 2.2e-16

How to test collinearity

library(car)
vif(lmFit)
##         x1         x2         x3 
##   1.014903 118.852908 119.013351
summary(lm(y~x1 + x2, data=xyData))
## 
## Call:
## lm(formula = y ~ x1 + x2, data = xyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3139  -1.7410  -0.1042   1.8065   7.0280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5612     1.7629  -0.886    0.378    
## x1            2.2572     0.3092   7.301 7.97e-11 ***
## x2            7.1196     0.2895  24.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.891 on 97 degrees of freedom
## Multiple R-squared:  0.8674, Adjusted R-squared:  0.8647 
## F-statistic: 317.4 on 2 and 97 DF,  p-value: < 2.2e-16

To solve these problems

In the past several decades, regularization methods provide better solutions to this problem

Ridge regression

\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\] - \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).

Ridge regression solution

\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\]

library(MASS)
lm.ridge(y~x1 + x2 + x3, data=xyData, lambda = 10)
##                  x1        x2        x3 
## 0.9529628 2.0535563 3.4643002 3.2671640

Prostate cancer data

The data is from the book element of statistical learning

library(ElemStatLearn)
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
prostate$train <- NULL

ridge regression Prostate cancer data example

library(MASS)
lm_ridge <- lm.ridge(lpsa ~ ., data=prostate, lambda=0); lm_ridge
##                    lcavol      lweight          age         lbph 
##  0.181560862  0.564341279  0.622019787 -0.021248185  0.096712523 
##          svi          lcp      gleason        pgg45 
##  0.761673403 -0.106050939  0.049227933  0.004457512
lm(lpsa ~ ., data=prostate)
## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph  
##    0.181561     0.564341     0.622020    -0.021248     0.096713  
##         svi          lcp      gleason        pgg45  
##    0.761673    -0.106051     0.049228     0.004458

ridge regression Prostate cancer data example 2

lm.ridge(lpsa ~ ., data=prostate, lambda=10)
##                    lcavol      lweight          age         lbph 
## -0.023822581  0.470383515  0.595477805 -0.015328827  0.082534382 
##          svi          lcp      gleason        pgg45 
##  0.663639476 -0.022092251  0.066864682  0.003190709
lm.ridge(lpsa ~ ., data=prostate, lambda=Inf)
##            lcavol  lweight      age     lbph      svi      lcp  gleason 
## 2.478387 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
##    pgg45 
## 0.000000

Summary for Ridge regression

lasso

lasso example

library(lars)
## Loaded lars 1.2
x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
coef(lassoFit, s=2, mode="lambda") ## get beta estimate when lambda = 2
##    lcavol   lweight       age      lbph       svi       lcp   gleason 
## 0.4506712 0.2910916 0.0000000 0.0000000 0.3475427 0.0000000 0.0000000 
##     pgg45 
## 0.0000000

Re-visit AIC and BIC

\(\|\beta\|_0\) equals to \(k\) where \(k\) is number of non-zero entries of \(\beta\).

visualize lasso path

plot(lassoFit)

Intuition for lasso and ridge regression

Intuition for lasso and ridge regression

From book Element of Statistical Learning

From book Element of Statistical Learning

How to choose Tuning parameter

lasso path

plot(lassoFit)

Elastic net

\[\hat{\beta}^{elastic} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda_2 \|\beta\|_2 + \lambda_1 \|\beta\|_1\]

Group lasso

\[\min_{\beta=(\beta_{(1)},\dots,\beta_{(G)}) \in \mathbb{R}^p} \frac{1}{2} ||y-X\beta||_2^2 + \lambda \sum_{i=1}^G \sqrt{p_{(i)}} ||\beta_{(i)}||_2,\]

Fused lasso

\[\min_{\beta \in \mathbb{R}^p} \frac{1}{2} || y - \beta ||_2^2 + \lambda \sum_{i=1}^{p-1} |\beta_i - \beta_{i+1}|\]

Generalized lasso

Consider a general setting \[\min_{\beta \in \mathbb{R}^p} f(\beta) + \lambda ||D\beta||_1\] where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is a smooth convex function. \(D \in \mathbb{R}^{m\times n}\) is a penalty matrix. When \(D=I\), the formulation will reduce to the lasso regression problem.

When \[D= \left( \begin{array}{cccccc} -1 & 1 & 0 & \ldots & 0 & 0 \\ 1 & -1 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & -1 & 1 \end{array} \right) \], The penalty will be the fussed lasso penalty.

Last slide

knitr::purl("lasso.Rmd", output = "lasso.R ", documentation = 2)
## 
## 
## processing file: lasso.Rmd
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |...                                                              |   5%
  |                                                                       
  |......                                                           |  10%
  |                                                                       
  |..........                                                       |  15%
  |                                                                       
  |.............                                                    |  20%
  |                                                                       
  |................                                                 |  25%
  |                                                                       
  |....................                                             |  30%
  |                                                                       
  |.......................                                          |  35%
  |                                                                       
  |..........................                                       |  40%
  |                                                                       
  |.............................                                    |  45%
  |                                                                       
  |................................                                 |  50%
  |                                                                       
  |....................................                             |  55%
  |                                                                       
  |.......................................                          |  60%
  |                                                                       
  |..........................................                       |  65%
  |                                                                       
  |..............................................                   |  70%
  |                                                                       
  |.................................................                |  75%
  |                                                                       
  |....................................................             |  80%
  |                                                                       
  |.......................................................          |  85%
  |                                                                       
  |..........................................................       |  90%
  |                                                                       
  |..............................................................   |  95%
  |                                                                       
  |.................................................................| 100%
## output file: lasso.R
## [1] "lasso.R "