Zhiguang Huo (Caleb)
Wednesday October 31st, 2018
\[Y = X\beta + \varepsilon\]
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable
## [1] 0.9957515
x <- data.frame(x1,x2,x3)
y <- 2*x1 + 3*x2 + 4*x3 + rnorm(n, sd = 3)
xyData <- cbind(y,x)
lmFit <- lm(y~x1 + x2 + x3, data=xyData)
summary(lmFit)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = xyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.2838 -1.7470 0.0118 1.8627 6.9231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5342 1.7695 -0.867 0.3881
## x1 2.2746 0.3116 7.299 8.39e-11 ***
## x2 5.2712 3.1574 1.669 0.0983 .
## x3 1.8343 3.1201 0.588 0.5580
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.9 on 96 degrees of freedom
## Multiple R-squared: 0.8679, Adjusted R-squared: 0.8638
## F-statistic: 210.3 on 3 and 96 DF, p-value: < 2.2e-16
## Loading required package: carData
## x1 x2 x3
## 1.014903 118.852908 119.013351
##
## Call:
## lm(formula = y ~ x1 + x2, data = xyData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3139 -1.7410 -0.1042 1.8065 7.0280
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.5612 1.7629 -0.886 0.378
## x1 2.2572 0.3092 7.301 7.97e-11 ***
## x2 7.1196 0.2895 24.595 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.891 on 97 degrees of freedom
## Multiple R-squared: 0.8674, Adjusted R-squared: 0.8647
## F-statistic: 317.4 on 2 and 97 DF, p-value: < 2.2e-16
In the past several decades, regularization methods provide better solutions to this problem
\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\] - \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).
\[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2 + \lambda \| \beta\|^2_2\]
## x1 x2 x3
## 0.9529628 2.0535563 3.4643002 3.2671640
The data is from the book element of statistical learning
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
## lcavol lweight age lbph
## 0.181560862 0.564341279 0.622019787 -0.021248185 0.096712523
## svi lcp gleason pgg45
## 0.761673403 -0.106050939 0.049227933 0.004457512
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph
## 0.181561 0.564341 0.622020 -0.021248 0.096713
## svi lcp gleason pgg45
## 0.761673 -0.106051 0.049228 0.004458
## lcavol lweight age lbph
## -0.023822581 0.470383515 0.595477805 -0.015328827 0.082534382
## svi lcp gleason pgg45
## 0.663639476 -0.022092251 0.066864682 0.003190709
## lcavol lweight age lbph svi lcp gleason
## 2.478387 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## pgg45
## 0.000000
## Loaded lars 1.2
x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
coef(lassoFit, s=2, mode="lambda") ## get beta estimate when lambda = 2.
## lcavol lweight age lbph svi lcp gleason
## 0.4506712 0.2910916 0.0000000 0.0000000 0.3475427 0.0000000 0.0000000
## pgg45
## 0.0000000
library(lars)
x <- as.matrix(prostate[,1:8])
y <- prostate[,9]
lassoFit <- lars(x, y) ## lar for least angle regression
coef(lassoFit, s=0.5, mode="fraction") ## get beta estimate when mu = 0.5*||beta_LS||_1
## lcavol lweight age lbph svi
## 0.4746756454 0.4143707391 0.0000000000 0.0000000000 0.4499667198
## lcp gleason pgg45
## 0.0000000000 0.0000000000 0.0001502949
\[\hat{\beta}^{IC} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda \|\beta\|_0\]
\[\hat{\beta}^{elastic} = \arg\min_{\beta \in \mathbb{R}^P} \frac{1}{2} \|y - X\beta\|_2^2 + \lambda_2 \|\beta\|_2 + \lambda_1 \|\beta\|_1\]
\[\min_{\beta=(\beta_{(1)},\dots,\beta_{(G)}) \in \mathbb{R}^p} \frac{1}{2} ||y-X\beta||_2^2 + \lambda \sum_{i=1}^G \sqrt{p_{(i)}} ||\beta_{(i)}||_2,\]
\[\min_{\beta \in \mathbb{R}^p} \frac{1}{2} || y - \beta ||_2^2 + \lambda \sum_{i=1}^{p-1} |\beta_i - \beta_{i+1}|\]
Consider a general setting \[\min_{\beta \in \mathbb{R}^p} f(\beta) + \lambda ||D\beta||_1\] where \(f: \mathbb{R}^n \rightarrow \mathbb{R}\) is a smooth convex function. \(D \in \mathbb{R}^{m\times n}\) is a penalty matrix.