Zhiguang Huo (Caleb)
Wednesday October 4th, 2017
\[Y = X\beta + \varepsilon\]
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable
When \(n>p\), number of subjects is larger than number of features (variables), linear model works fine.
n <- 100
set.seed(32611)
x1 <- rnorm(n,3)
x2 <- rnorm(n,5)
x3 <- x2 + rnorm(n,sd=0.1)
cor(x2, x3)
## [1] 0.9957515
x <- data.frame(x1,x2,x3)
y <- 2*x1 + 3*x2 + 4*x3 + rnorm(n, sd = 3)
xyData <- cbind(y,x)
lmFit <- lm(y~x1 + x2 + x3, data=xyData)
summary(lmFit)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = xyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2838 -1.7470 0.0118 1.8627 6.9231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5342 1.7695 -0.867 0.3881
## x1 2.2746 0.3116 7.299 8.39e-11 ***
## x2 5.2712 3.1574 1.669 0.0983 .
## x3 1.8343 3.1201 0.588 0.5580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.9 on 96 degrees of freedom
## Multiple R-squared: 0.8679, Adjusted R-squared: 0.8638
## F-statistic: 210.3 on 3 and 96 DF, p-value: < 2.2e-16
library(car)
vif(lmFit)
## x1 x2 x3
## 1.014903 118.852908 119.013351
summary(lm(y~x1 + x2, data=xyData))
##
## Call:
## lm(formula = y ~ x1 + x2, data = xyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3139 -1.7410 -0.1042 1.8065 7.0280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5612 1.7629 -0.886 0.378
## x1 2.2572 0.3092 7.301 7.97e-11 ***
## x2 7.1196 0.2895 24.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.891 on 97 degrees of freedom
## Multiple R-squared: 0.8674, Adjusted R-squared: 0.8647
## F-statistic: 317.4 on 2 and 97 DF, p-value: < 2.2e-16
In the past several decades, regularization methods provide better solutions to this problem
\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\] - \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).
\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\]
library(MASS)
lm.ridge(y~x1 + x2 + x3, data=xyData, lambda = 10)
## x1 x2 x3
## 0.9529628 2.0535563 3.4643002 3.2671640
The data is from the book element of statistical learning
library(ElemStatLearn)
str(prostate)
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
prostate$train <- NULL
library(MASS)
lm_ridge <- lm.ridge(lpsa ~ ., data=prostate, lambda=0); lm_ridge
## lcavol lweight age lbph
## 0.181560862 0.564341279 0.622019787 -0.021248185 0.096712523
## svi lcp gleason pgg45
## 0.761673403 -0.106050939 0.049227933 0.004457512
lm(lpsa ~ ., data=prostate)
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph
## 0.181561 0.564341 0.622020 -0.021248 0.096713
## svi lcp gleason pgg45
## 0.761673 -0.106051 0.049228 0.004458
lm.ridge(lpsa ~ ., data=prostate, lambda=10)
## lcavol lweight age lbph
## -0.023822581 0.470383515 0.595477805 -0.015328827 0.082534382
## svi lcp gleason pgg45
## 0.663639476 -0.022092251 0.066864682 0.003190709
lm.ridge(lpsa ~ ., data=prostate, lambda=Inf)
## lcavol lweight age lbph svi lcp gleason
## 2.478387 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## pgg45
## 0.000000
library(lars)
## Loaded lars 1.2
x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
coef(lassoFit, s=2, mode="lambda") ## get beta estimate when lambda = 2
## lcavol lweight age lbph svi lcp gleason
## 0.4506712 0.2910916 0.0000000 0.0000000 0.3475427 0.0000000 0.0000000
## pgg45
## 0.0000000
\(\|\beta\|_0\) equals to \(k\) where \(k\) is number of non-zero entries of \(\beta\).
plot(lassoFit)
plot(lassoFit)
\[\hat{\beta}^{elastic} = \arg\min_{\beta \in \mathbb{R}^P} \|y - X\beta\|_2^2 + \lambda_2 \|\beta\|_2 + \lambda_1 \|\beta\|_1\]
\[\min_{\beta=(\beta_{(1)},\dots,\beta_{(G)}) \in \mathbb{R}^p} \frac{1}{2} ||y-X\beta||_2^2 + \lambda \sum_{i=1}^G \sqrt{p_{(i)}} ||\beta_{(i)}||_2,\]
\[\min_{\beta \in \mathbb{R}^p} \frac{1}{2} || y - \beta ||_2^2 + \lambda \sum_{i=1}^{p-1} |\beta_i - \beta_{i+1}|\]
Consider a general setting \[\min_{\beta \in \mathbb{R}^p} f(\beta) + \lambda ||D\beta||_1\] where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is a smooth convex function. \(D \in \mathbb{R}^{m\times n}\) is a penalty matrix. When \(D=I\), the formulation will reduce to the lasso regression problem.
When \[D= \left( \begin{array}{cccccc} -1 & 1 & 0 & \ldots & 0 & 0 \\ 1 & -1 & 1 & \ldots & 0 & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots & \vdots \\ 0 & 0 & 0 & \ldots & -1 & 1 \end{array} \right) \], The penalty will be the fussed lasso penalty.
knitr::purl("lasso.Rmd", output = "lasso.R ", documentation = 2)
##
##
## processing file: lasso.Rmd
##
|
| | 0%
|
|... | 5%
|
|...... | 10%
|
|.......... | 15%
|
|............. | 20%
|
|................ | 25%
|
|.................... | 30%
|
|....................... | 35%
|
|.......................... | 40%
|
|............................. | 45%
|
|................................ | 50%
|
|.................................... | 55%
|
|....................................... | 60%
|
|.......................................... | 65%
|
|.............................................. | 70%
|
|................................................. | 75%
|
|.................................................... | 80%
|
|....................................................... | 85%
|
|.......................................................... | 90%
|
|.............................................................. | 95%
|
|.................................................................| 100%
## output file: lasso.R
## [1] "lasso.R "