linear regression model

\(Y = (Y_1, \ldots, Y_n)^\top \in \mathbb{R}^n\)
\(X = (1_n, X_1, \ldots, X_p) \in \mathbb{R}^{n \times (p+1)}\)
- \(1_n \in \mathbb{R}^n\),
- \(X_j \in \mathbb{R}^n\) with \(1 \le j \le p\)
\(\beta = (\beta_0, \beta_1, \ldots, \beta_p)^\top \in \mathbb{R}^{p+1}\)
\(\varepsilon = (\varepsilon_1, \ldots, \varepsilon_n)^\top \in \mathbb{R}^n\)

\[Y = X\beta + \varepsilon\]

Solution to linear regression model

least square estimator (LS): \[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2\]
- \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable

Problem with linear regression

When \(n>p\), number of subjects is larger than number of features (variables), linear model works fine.
When \(n<p\), \(\hat{\beta} = (X^\top X)^{-1} X^\top Y\), \(X^\top X \in \mathbb{R}^{p \times p}\) is singular. Thus we cannot estimate \(\hat{\beta}\)
- Solution: apply PCA to \(X\) and reduce to \(X' \in \mathbb{R}^{n \times r}\), where \(r<n\). (interpretation)
- model selection: using backward selection, forward selection with BIC or AIC. (Searching space is large)
When \(n>p\), if two features are highly correlated, the resulting coefficients will have high variance and thus not stable.
- Solution: PCA
- Solution: use one one variable among all highly correlated variables as representative (using VIF to detect)

Collinearity

n <- 100
set.seed(32611)
x1 <- rnorm(n,3)
x2 <- rnorm(n,5)
x3 <- x2 + rnorm(n,sd=0.1)
cor(x2, x3)

## [1] 0.9957515

x <- data.frame(x1,x2,x3)
y <- 2*x1 + 3*x2 + 4*x3 + rnorm(n, sd = 3)
xyData <- cbind(y,x)
lmFit <- lm(y~x1 + x2 + x3, data=xyData)
summary(lmFit)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = xyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2838  -1.7470   0.0118   1.8627   6.9231 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5342     1.7695  -0.867   0.3881    
## x1            2.2746     0.3116   7.299 8.39e-11 ***
## x2            5.2712     3.1574   1.669   0.0983 .  
## x3            1.8343     3.1201   0.588   0.5580    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.9 on 96 degrees of freedom
## Multiple R-squared:  0.8679, Adjusted R-squared:  0.8638 
## F-statistic: 210.3 on 3 and 96 DF,  p-value: < 2.2e-16

How to test collinearity

Remove variable with VIF > 10

library(car)
vif(lmFit)

##         x1         x2         x3 
##   1.014903 118.852908 119.013351

summary(lm(y~x1 + x2, data=xyData))

## 
## Call:
## lm(formula = y ~ x1 + x2, data = xyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3139  -1.7410  -0.1042   1.8065   7.0280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5612     1.7629  -0.886    0.378    
## x1            2.2572     0.3092   7.301 7.97e-11 ***
## x2            7.1196     0.2895  24.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.891 on 97 degrees of freedom
## Multiple R-squared:  0.8674, Adjusted R-squared:  0.8647 
## F-statistic: 317.4 on 2 and 97 DF,  p-value: < 2.2e-16

To solve these problems

In the past several decades, regularization methods provide better solutions to this problem

ridge regression.
lasso.
Elastic net

Ridge regression

\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\] - \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).

\(\lambda \ge 0\) is a tuning parameter, controling the strength of the penalty term.
- \(\lambda =0\), we have original linear regression \(\hat{\beta}^{ridge} = \hat{\beta}^{LS}\).
- \(\lambda = \infty\), we get \(\hat{\beta}^{ridge} = 0\)
- For \(\lambda\), we both fit a linear model and shrink the coefficients.

Ridge regression solution

\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\]

\(\hat{\beta} = (X^\top X + \lambda I)^{-1} X^\top Y\)
As \(\lambda\) increases, the bias increases and the variance decreases.

library(MASS)
lm.ridge(y~x1 + x2 + x3, data=xyData, lambda = 10)

##                  x1        x2        x3 
## 0.9529628 2.0535563 3.4643002 3.2671640

Prostate cancer data

The data is from the book element of statistical learning

library(ElemStatLearn)
str(prostate)

## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

prostate$train <- NULL

ridge regression Prostate cancer data example

library(MASS)
lm_ridge <- lm.ridge(lpsa ~ ., data=prostate, lambda=0); lm_ridge

##                    lcavol      lweight          age         lbph 
##  0.181560862  0.564341279  0.622019787 -0.021248185  0.096712523 
##          svi          lcp      gleason        pgg45 
##  0.761673403 -0.106050939  0.049227933  0.004457512

lm(lpsa ~ ., data=prostate)

## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph  
##    0.181561     0.564341     0.622020    -0.021248     0.096713  
##         svi          lcp      gleason        pgg45  
##    0.761673    -0.106051     0.049228     0.004458

ridge regression Prostate cancer data example 2

lm.ridge(lpsa ~ ., data=prostate, lambda=10)

##                    lcavol      lweight          age         lbph 
## -0.023822581  0.470383515  0.595477805 -0.015328827  0.082534382 
##          svi          lcp      gleason        pgg45 
##  0.663639476 -0.022092251  0.066864682  0.003190709

lm.ridge(lpsa ~ ., data=prostate, lambda=Inf)

##            lcavol  lweight      age     lbph      svi      lcp  gleason 
## 2.478387 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
##    pgg45 
## 0.000000

Summary for Ridge regression

Ridge regression will stabilize the varaince of the coefficient estiamtes.
Ridge regression will increase bias but decrease variance.
Problem with ridge regression: coefficient won’t be shrinked to exact 0.

lasso

The full name of Lasso is least absolute shrinkage and selection operator (Tibshirani, 1996)
Lasso, the \(l_1\) norm penalty will shrink some of the coefficients to exact zero. \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1\]
\(\|\beta\|_1 = \sum_{j=1}^p |\beta_j|\) is called \(l_1\) norm of \(\beta\).
\(\lambda \ge 0\) is a tuning parameter, controling the strength of the penalty term.
- \(\lambda =0\), we have original linear regression.
- \(\lambda = \infty\), we get \(\hat{\beta}^{lasso} = 0\)
- For \(\lambda\), we both fit a linear model and shrink some coefficients to exact zero.

lasso example

lasso was implemented in R lars package.

library(lars)

## Loaded lars 1.2

x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
coef(lassoFit, s=2, mode="lambda") ## get beta estimate when lambda = 2

##    lcavol   lweight       age      lbph       svi       lcp   gleason 
## 0.4506712 0.2910916 0.0000000 0.0000000 0.3475427 0.0000000 0.0000000 
##     pgg45 
## 0.0000000

Re-visit AIC and BIC

AIC and BIC will achieve feature selection.
equivalently: \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda \|\beta\|_0\]

\(\|\beta\|_0\) equals to \(k\) where \(k\) is number of non-zero entries of \(\beta\).

visualize lasso path

lasso solution (beta estimate is piecewise linear with respect to lambda)

plot(lassoFit)

Intuition for lasso and ridge regression

lasso regression equivalent forms.
- in penalty form: \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1\]
- in constraint form: \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2, s.t, \|\beta\|_1 \le \mu\]
ridge regression equivalent forms.
- in penalty form: \[\hat{\beta}^{ridge} = \arg\min_{\beta\in\mathbb{R}^p} \|y - X\beta\|_2^2 + \lambda\|\beta\|_2^2\]
- in constraint form: \[\hat{\beta}^{ridge} = \arg\min_{\beta\in\mathbb{R}^p} \|y - X\beta\|_2^2, s.t, \|\beta\|_2^2 \le \mu\]

Intuition for lasso and ridge regression

Lasso: \(\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2, s.t, \|\beta\|_1 \le \mu\)
Ridge: \(\hat{\beta}^{ridge} = \arg\min_{\beta\in\mathbb{R}^p} \|y - X\beta\|_2^2, s.t, \|\beta\|_2^2 \le \mu\)

From book Element of Statistical Learning

How to choose Tuning parameter

cross validation, leave this for future lecture.

lasso path

plot(lassoFit)

lasso solution is piecewise linear.
we only need to calculate the solution at the knots

Elastic net

\[\hat{\beta}^{elastic} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda_2 \|\beta\|_2 + \lambda_1 \|\beta\|_1\]

In their paper, they claim elastic net has smaller Mean squared error.

Group lasso

\[\min_{\beta=(\beta_{(1)},\dots,\beta_{(G)}) \in \mathbb{R}^p} \frac{1}{2} ||y-X\beta||_2^2 + \lambda \sum_{i=1}^G \sqrt{p_{(i)}} ||\beta_{(i)}||_2,\]

Where \(y \in \mathbb{R}^n\) is outcome and \(X \in \mathbb{R}^{n\times p}\) is the design matrix.
The design matrix can be partitioned in to \(G\) groups \(X = [X_{(1)}, \ldots, X_{(G)}]\) where \(X_{(i)} \in \mathbb{R}^{n \times p_{(i)}}\).
\(\beta \in \mathbb{R}^p\) is the predictor and it is partitioned in to \(G\) groups.

Fused lasso

\[\min_{\beta \in \mathbb{R}^p} \frac{1}{2} || y - \beta ||_2^2 + \lambda \sum_{i=1}^{p-1} |\beta_i - \beta_{i+1}|\]

Generalized lasso

Consider a general setting \[\min_{\beta \in \mathbb{R}^p} f(\beta) + \lambda ||D\beta||_1\] where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is a smooth convex function. \(D \in \mathbb{R}^{m\times n}\) is a penalty matrix. When \(D=I\), the formulation will reduce to the lasso regression problem.

When \[D= \left( \begin{array}{cccccc} -1 & 1 & 0 & \ldots & 0 & 0 \\ 1 & -1 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & -1 & 1 \end{array} \right) \], The penalty will be the fussed lasso penalty.

Last slide

knitr::purl("lasso.Rmd", output = "lasso.R ", documentation = 2)

## 
## 
## processing file: lasso.Rmd

## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |...                                                              |   5%
  |                                                                       
  |......                                                           |  10%
  |                                                                       
  |..........                                                       |  15%
  |                                                                       
  |.............                                                    |  20%
  |                                                                       
  |................                                                 |  25%
  |                                                                       
  |....................                                             |  30%
  |                                                                       
  |.......................                                          |  35%
  |                                                                       
  |..........................                                       |  40%
  |                                                                       
  |.............................                                    |  45%
  |                                                                       
  |................................                                 |  50%
  |                                                                       
  |....................................                             |  55%
  |                                                                       
  |.......................................                          |  60%
  |                                                                       
  |..........................................                       |  65%
  |                                                                       
  |..............................................                   |  70%
  |                                                                       
  |.................................................                |  75%
  |                                                                       
  |....................................................             |  80%
  |                                                                       
  |.......................................................          |  85%
  |                                                                       
  |..........................................................       |  90%
  |                                                                       
  |..............................................................   |  95%
  |                                                                       
  |.................................................................| 100%

## output file: lasso.R

## [1] "lasso.R "

Biostatistical Computing, PHC 6068

Ridge regression, Lasso and elastic net.

Outlines

linear regression model

Solution to linear regression model

Problem with linear regression

Collinearity

How to test collinearity

To solve these problems

Ridge regression

Ridge regression solution

Prostate cancer data

ridge regression Prostate cancer data example

ridge regression Prostate cancer data example 2

Summary for Ridge regression

lasso

lasso example

Re-visit AIC and BIC

visualize lasso path

Intuition for lasso and ridge regression

Intuition for lasso and ridge regression

How to choose Tuning parameter

lasso path

Elastic net

Group lasso

Fused lasso

Generalized lasso

Last slide