linear regression model

\(Y = (Y_1, \ldots, Y_n)^\top \in \mathbb{R}^n\)
\(X = (1_n, X_1, \ldots, X_p) \in \mathbb{R}^{n \times (p+1)}\)
- \(1_n \in \mathbb{R}^n\),
- \(X_j \in \mathbb{R}^n\) with \(1 \le j \le p\)
\(\beta = (\beta_0, \beta_1, \ldots, \beta_p)^\top \in \mathbb{R}^{p+1}\)
\(\varepsilon = (\varepsilon_1, \ldots, \varepsilon_n)^\top \in \mathbb{R}^n\)

\[Y = X\beta + \varepsilon\]

Solution to linear regression model

least square estimator (LS): \[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2\]
- \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable

Problem with linear regression

When \(n>p\), number of subjects is larger than number of features (variables), linear model works fine.
When \(n<p\), \(\hat{\beta} = (X^\top X)^{-1} X^\top Y\), \(X^\top X \in \mathbb{R}^{p \times p}\) is singular. Thus we cannot estimate \(\hat{\beta}\)
- Solution: apply PCA to \(X\) and reduce to \(X' \in \mathbb{R}^{n \times r}\), where \(r<n\). (lack of interpretation)
- model selection: using backward selection, forward selection with BIC or AIC. (Searching space is large)
When \(n>p\), if two features are highly correlated, the resulting coefficients will have high variance and thus not stable.
- Solution: PCA
- Solution: use one one variable among all highly correlated variables as representative (using VIF to detect)

Collinearity

n <- 100
set.seed(32611)
x1 <- rnorm(n,3)
x2 <- rnorm(n,5)
x3 <- x2 + rnorm(n,sd=0.1)
cor(x2, x3)

## [1] 0.9957515

x <- data.frame(x1,x2,x3)
y <- 2*x1 + 3*x2 + 4*x3 + rnorm(n, sd = 3)
xyData <- cbind(y,x)
lmFit <- lm(y~x1 + x2 + x3, data=xyData)
summary(lmFit)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = xyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2838  -1.7470   0.0118   1.8627   6.9231 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5342     1.7695  -0.867   0.3881    
## x1            2.2746     0.3116   7.299 8.39e-11 ***
## x2            5.2712     3.1574   1.669   0.0983 .  
## x3            1.8343     3.1201   0.588   0.5580    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.9 on 96 degrees of freedom
## Multiple R-squared:  0.8679, Adjusted R-squared:  0.8638 
## F-statistic: 210.3 on 3 and 96 DF,  p-value: < 2.2e-16

How to test collinearity

Remove variable with VIF > 10

library(car)

## Loading required package: carData

vif(lmFit)

##         x1         x2         x3 
##   1.014903 118.852908 119.013351

summary(lm(y~x1 + x2, data=xyData))

## 
## Call:
## lm(formula = y ~ x1 + x2, data = xyData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3139  -1.7410  -0.1042   1.8065   7.0280 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.5612     1.7629  -0.886    0.378    
## x1            2.2572     0.3092   7.301 7.97e-11 ***
## x2            7.1196     0.2895  24.595  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.891 on 97 degrees of freedom
## Multiple R-squared:  0.8674, Adjusted R-squared:  0.8647 
## F-statistic: 317.4 on 2 and 97 DF,  p-value: < 2.2e-16

To solve these problems

Regularization methods provide better solutions to this problem

Ridge regression.
lasso.
Elastic net

Ridge regression

\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\] - \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).

\(\lambda \ge 0\) is a tuning parameter, controling the strength of the penalty term.
- \(\lambda =0\), Ridge regression reduces to linear regression \(\hat{\beta}^{Ridge} = \hat{\beta}^{LS}\).
- \(\lambda = \infty\), we get \(\hat{\beta}^{Ridge} = 0\)
- Given a positive \(\lambda\), we both fit a linear model and shrink the coefficients.

Ridge regression solution

\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\]

\(\hat{\beta} = (X^\top X + \lambda I)^{-1} X^\top Y\)
As \(\lambda\) increases, the bias increases and the variance decreases.

library(MASS)
lm.ridge(y~x1 + x2 + x3, data=xyData, lambda = 10)

##                  x1        x2        x3 
## 0.9529628 2.0535563 3.4643002 3.2671640

Prostate cancer data

The data is from the book element of statistical learning

library(ElemStatLearn)
str(prostate)

## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

prostate$train <- NULL

Ridge regression Prostate cancer data example

library(MASS)
lm_ridge <- lm.ridge(lpsa ~ ., data=prostate, lambda=0); lm_ridge

##                    lcavol      lweight          age         lbph          svi 
##  0.181560862  0.564341279  0.622019787 -0.021248185  0.096712523  0.761673403 
##          lcp      gleason        pgg45 
## -0.106050939  0.049227933  0.004457512

lm(lpsa ~ ., data=prostate)

## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph          svi  
##    0.181561     0.564341     0.622020    -0.021248     0.096713     0.761673  
##         lcp      gleason        pgg45  
##   -0.106051     0.049228     0.004458

Ridge regression Prostate cancer data example 2

lm.ridge(lpsa ~ ., data=prostate, lambda=10)

##                    lcavol      lweight          age         lbph          svi 
## -0.023822581  0.470383515  0.595477805 -0.015328827  0.082534382  0.663639476 
##          lcp      gleason        pgg45 
## -0.022092251  0.066864682  0.003190709

lm.ridge(lpsa ~ ., data=prostate, lambda=Inf)

##            lcavol  lweight      age     lbph      svi      lcp  gleason 
## 2.478387 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 
##    pgg45 
## 0.000000

Summary for Ridge regression

Ridge regression will stabilize the varaince of the coefficient estiamtes.
Ridge regression will increase bias but decrease variance.
Problem with Ridge regression: coefficient won’t be shrinked to exact 0, unless \(\lambda = +\infty\).

lasso

The full name of Lasso is least absolute shrinkage and selection operator (Tibshirani, 1996)
Lasso, the \(l_1\) norm penalty will shrink some of the coefficients to exact zero. \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1\]
\(\|\beta\|_1 = \sum_{j=1}^p |\beta_j|\) is called \(l_1\) norm of \(\beta\).
\(\lambda \ge 0\) is a tuning parameter, controling the strength of the penalty term.
- \(\lambda =0\), we have original linear regression.
- \(\lambda = \infty\), we get \(\hat{\beta}^{lasso} = 0\)
- For \(\lambda\), we both fit a linear model and shrink some coefficients to exact zero.

lasso example (penalty form)

lasso was implemented in R lars package.

library(lars)

## Loaded lars 1.2

x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
coef(lassoFit, s=2, mode="lambda") ## get beta estimate when lambda = 2.

##    lcavol   lweight       age      lbph       svi       lcp   gleason     pgg45 
## 0.4506712 0.2910916 0.0000000 0.0000000 0.3475427 0.0000000 0.0000000 0.0000000

## Mode="lambda" uses s as the lasso regularization parameter for lambda

lasso, the penalty form and the constraint form

lasso regression equivalent forms.
- in penalty form: \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1\]
- in constraint form: \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2, s.t, \|\beta\|_1 \le \mu\]
- When \(\mu > \|\hat{\beta}^{LS}\|_1\), the solution will always be \(\beta = \hat{\beta}^{LS}\), we could also rewrite as \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2, s.t, \|\beta\|_1 \le s\|\hat{\beta}^{LS}\|_1\] where \(0 \le s \le 1\)

lasso example (constraint form)

lasso was implemented in R lars package.

library(lars)
x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
coef(lassoFit, s=0.5, mode="fraction") ## get beta estimate when mu = 0.5*||beta_LS||_1

##       lcavol      lweight          age         lbph          svi          lcp 
## 0.4746756454 0.4143707391 0.0000000000 0.0000000000 0.4499667198 0.0000000000 
##      gleason        pgg45 
## 0.0000000000 0.0001502949

## If mode="fraction", then s should be a number between 0 and 1, 
## and it refers to the ratio of the L1 norm of the coefficient vector, 
## relative to the norm at the full LS solution.

Re-visit AIC and BIC

AIC and BIC will achieve feature selection.
- \(AIC = 2k - 2 \log ( {\hat{L}} )\)
- \(BIC = log(n)k - 2 \log ( {\hat{L}} )\)
equivalently:

\[\hat{\beta}^{IC} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda \|\beta\|_0\]

\(\|\beta\|_0\) equals to \(k\) where \(k\) is number of non-zero entries of \(\beta\).
\(\|y - X\beta\|_2^2\) is equivalent to the likelihood function for Gaussian error.

visualize lasso path

lasso solution (beta estimate is piecewise linear with respect to lambda)
- \(\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2, s.t, \|\beta\|_1 \le \mu\)
- When \(\mu > \|\hat{\beta}^{LS}\|_1\), \(\hat{\beta}^{lasso} = \hat{\beta}^{LS}\)

plot(lassoFit)

x-axis: \(\frac{\mu}{\|\hat{\beta}^{LS}\|_1}\)

Intuition for lasso and Ridge regression

lasso regression equivalent forms.
- in penalty form: \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda \|\beta\|_1\]
- in constraint form: \[\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2, s.t, \|\beta\|_1 \le \mu\]
Ridge regression equivalent forms.
- in penalty form: \[\hat{\beta}^{Ridge} = \arg\min_{\beta\in\mathbb{R}^p} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda\|\beta\|_2^2\]
- in constraint form: \[\hat{\beta}^{Ridge} = \arg\min_{\beta\in\mathbb{R}^p} \frac{1}{2} \|y - X\beta\|_2^2, s.t, \|\beta\|_2^2 \le \mu\]

Intuition for lasso and Ridge regression

Lasso: \(\hat{\beta}^{lasso} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2, s.t, \|\beta\|_1 \le \mu\)
Ridge: \(\hat{\beta}^{Ridge} = \arg\min_{\beta\in\mathbb{R}^p} \frac{1}{2} \|y - X\beta\|_2^2, s.t, \|\beta\|_2^2 \le \mu\)

From book Element of Statistical Learning

How to choose Tuning parameter

cross validation, we leave this for future lectures.

lasso path

plot(lassoFit)

lasso solution is piecewise linear.
The lars package only calculates the solution at the knots

Elastic net

\[\hat{\beta}^{elastic} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda_2 \|\beta\|_2^2 + \lambda_1 \|\beta\|_1\]

In their paper, they claim elastic net has smaller Mean squared error.

Elastic net

\[\hat{\beta}^{elastic} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda ( (1 - \alpha ) \|\beta\|_2^2 + \alpha \|\beta\|_1)\]

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1

fit.lasso <- glmnet(x, y, family="gaussian", alpha=1)
fit.ridge <- glmnet(x, y, family="gaussian", alpha=0)
fit.elnet <- glmnet(x, y, family="gaussian", alpha=.5)

par(mfrow=c(2,2))
# For plotting options, type '?plot.glmnet' in R console
plot(fit.lasso, xvar="lambda", main = "LASSO")
plot(fit.ridge, xvar="lambda", main = "Ridge")
plot(fit.elnet, xvar="lambda", main = "Elastic Net")

Group lasso

\[\min_{\beta=(\beta_{(1)},\dots,\beta_{(G)}) \in \mathbb{R}^p} \frac{1}{2} ||y-X\beta||_2^2 + \lambda \sum_{i=1}^G \sqrt{p_{(i)}} ||\beta_{(i)}||_2,\]

Where \(y \in \mathbb{R}^n\) is outcome and \(X \in \mathbb{R}^{n\times p}\) is the design matrix.
The design matrix can be partitioned in to \(G\) groups \(X = [X_{(1)}, \ldots, X_{(G)}]\) where \(X_{(i)} \in \mathbb{R}^{n \times p_{(i)}}\).
\(\beta \in \mathbb{R}^p\) is the predictor and it is partitioned in to \(G\) groups.

Fused lasso

\(y\in \mathbb{R}^p\), (e.g., stock price change)
\(\beta \in \mathbb{R}^p\), (e.g., underlying stock momentum)

\[\min_{\beta \in \mathbb{R}^p} \frac{1}{2} || y - \beta ||_2^2 + \lambda \sum_{i=1}^{p-1} |\beta_i - \beta_{i+1}|\]

Generalized lasso

Consider a general setting \[\min_{\beta \in \mathbb{R}^p} f(\beta) + \lambda ||D\beta||_1\] where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is a smooth convex function. \(D \in \mathbb{R}^{m\times n}\) is a penalty matrix.

When \(D=I\), the formulation will reduce to the lasso regression problem.
When \[D= \left( \begin{array}{cccccc} -1 & 1 & 0 & \ldots & 0 & 0 \\ 1 & -1 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & -1 & 1 \end{array} \right),\] The penalty will be the fussed lasso penalty.

Biostatistical Computing, PHC 6068

Ridge regression, Lasso, and elastic net

Outlines

linear regression model

Solution to linear regression model

Problem with linear regression

Collinearity

How to test collinearity

To solve these problems

Ridge regression

Ridge regression solution

Prostate cancer data

Ridge regression Prostate cancer data example

Ridge regression Prostate cancer data example 2

Summary for Ridge regression

lasso

lasso example (penalty form)

lasso, the penalty form and the constraint form

lasso example (constraint form)

Re-visit AIC and BIC

visualize lasso path

Intuition for lasso and Ridge regression

Intuition for lasso and Ridge regression

How to choose Tuning parameter

lasso path

Elastic net

Elastic net

Group lasso

Fused lasso

Generalized lasso