Zhiguang Huo (Caleb)
Monday October 22, 2018
In mathematics, “optimization” or “mathematical programming” refers to the selection of a best element (with regard to some criterion) from some set of available alternatives.
A typical optimization problem usually consists of maximizing or minimizing a real function (objective function) by systematically choosing input values under certain constraint.
“Convex programming” studies the case when the objective function is convex (minimization) or concave (maximization) and the constraint set is convex.
Convex function properties:
Suppose our objective function is \[f(x) = e^x + x^4\] What is the minimum value of \(f(x)\), what is the correponding \(x\)?
## $minimum
## [1] -0.5282468
##
## $objective
## [1] 0.6675038
Gradient decent: choose initial \(x^{(0)} \in \mathbb{R}\), repeat: \[x^{(k)} = x^{(k - 1)} - t\times f'(x^{(k - 1)}), k = 1,2,3,\ldots\]
\[f(y) \approx f(x^{(k - 1)}) + f'(x^{(k - 1)}) \times (y - x^{(k - 1)}) + \frac{1}{2t} (y - x^{(k - 1)})^2\]
Minimizing the quadratic approximation. \[x^{(k)} = \arg_y \min f(y)\]
Set \(f'(x^{(k)}) = 0 \Leftrightarrow f'(x^{(k - 1)}) + \frac{1}{t} (x^{(k)} - x^{(k - 1)}) = 0 \Leftrightarrow x^{(k)} = x^{(k - 1)} - t\times f'(x^{(k - 1)})\)
This is exactly the same as the gradient decent procedure.
f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function
x <- x0 <- 0.8; x_old <- 0; t <- 0.1; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)
while(abs(x - x_old) > error){
k <- k + 1
x_old <- x
x <- x_old - t*g(x_old)
trace <- c(trace, x) ## collecting results
points(x, f(x), col=k, pch=19)
}
## [1] 0.80000000 0.37264591 0.20678990 0.08028037 -0.02828567
## [6] -0.12548768 -0.21290391 -0.28986708 -0.35496119 -0.40719158
## [11] -0.44673749 -0.47504573 -0.49435026 -0.50702281 -0.51511484
## [16] -0.52018513 -0.52332289 -0.52524940 -0.52642641 -0.52714331
## [21] -0.52757914 -0.52784381 -0.52800441 -0.52810183 -0.52816091
## [26] -0.52819673 -0.52821844 -0.52823161 -0.52823959 -0.52824443
## [31] -0.52824736 -0.52824914 -0.52825021 -0.52825087
f <- function(x) x^2 ## original function
g <- function(x) 2*x ## gradient function
curve(f, from = -10, to = 10, lwd=2) ## visualize the objective function
x <- x0 <- 1; x_old <- 0; t <- 1.1; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)
while(abs(x - x_old) > error & k < 30){
k <- k + 1
x_old <- x
x <- x_old - t*g(x_old)
trace <- c(trace, x) ## collecting results
points(x, f(x), col=k, pch=19)
segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}
## [1] 1.000000 -1.200000 1.440000 -1.728000 2.073600
## [6] -2.488320 2.985984 -3.583181 4.299817 -5.159780
## [11] 6.191736 -7.430084 8.916100 -10.699321 12.839185
## [16] -15.407022 18.488426 -22.186111 26.623333 -31.948000
## [21] 38.337600 -46.005120 55.206144 -66.247373 79.496847
## [26] -95.396217 114.475460 -137.370552 164.844662 -197.813595
## [31] 237.376314
f <- function(x) x^2 ## original function
g <- function(x) 2*x ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function
x <- x0 <- 1; x_old <- 0; t <- 0.2; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)
while(abs(x - x_old) > error & k < 30){
k <- k + 1
x_old <- x
x <- x_old - t*g(x_old)
trace <- c(trace, x) ## collecting results
points(x, f(x), col=k, pch=19)
segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}
## [1] 1.000000e+00 6.000000e-01 3.600000e-01 2.160000e-01 1.296000e-01
## [6] 7.776000e-02 4.665600e-02 2.799360e-02 1.679616e-02 1.007770e-02
## [11] 6.046618e-03 3.627971e-03 2.176782e-03 1.306069e-03 7.836416e-04
## [16] 4.701850e-04 2.821110e-04 1.692666e-04 1.015600e-04 6.093597e-05
## [21] 3.656158e-05 2.193695e-05 1.316217e-05 7.897302e-06 4.738381e-06
## [26] 2.843029e-06 1.705817e-06 1.023490e-06
f <- function(x) x^2 ## original function
g <- function(x) 2*x ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function
x <- x0 <- 1; x_old <- 0; t <- 0.01; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)
while(abs(x - x_old) > error & k < 30){
k <- k + 1
x_old <- x
x <- x_old - t*g(x_old)
trace <- c(trace, x) ## collecting results
points(x, f(x), col=k, pch=19)
segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}
## [1] 1.0000000 0.9800000 0.9604000 0.9411920 0.9223682 0.9039208 0.8858424
## [8] 0.8681255 0.8507630 0.8337478 0.8170728 0.8007314 0.7847167 0.7690224
## [15] 0.7536419 0.7385691 0.7237977 0.7093218 0.6951353 0.6812326 0.6676080
## [22] 0.6542558 0.6411707 0.6283473 0.6157803 0.6034647 0.5913954 0.5795675
## [29] 0.5679762 0.5566167 0.5454843
Linear approximation (first order Taylor expansion) at the current point \(x\). \[f(x + t\Delta x) = f(x) + tf'(x)\Delta x\]
Decrease the slope of the approximation by \(\alpha\)
\[f(x) + t\alpha f'(x)\Delta x\]
A sufficient condition for a proposed \(t\) to guarantee decent (decrease objective function): \[f(x + t\Delta x) < f(x) + t\alpha f'(x)\Delta x\]
Plugin \(\Delta x = -f'(x)\) \[f(x - tf'(x)) < f(x) - t\alpha (f'(x))^2\]
Note: you may always want to initialize \(t = 1\) before backtracking line search.
f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function
x <- x0 <- 1; x_old <- 0; t0 <- 1; k <- 0; error <- 1e-6; beta = 0.8; alpha <- 0.4
trace <- x
points(x0, f(x0), col=1, pch=1)
while(abs(x - x_old) > error & k < 30){
k <- k + 1
x_old <- x
## backtracking
t <- t0
while(f(x_old - t*g(x_old)) > f(x_old) - alpha * t * g(x_old)^2){
t <- t * beta
}
x <- x_old - t*g(x_old)
trace <- c(trace, x) ## collecting results
points(x, f(x), col=k, pch=19)
segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}
## [1] 1.00000000 0.09828748 -0.61024238 -0.53353118 -0.52803658 -0.52825877
## [7] -0.52825165 -0.52825188
## 'data.frame': 97 obs. of 2 variables:
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcavol: num -0.58 -0.994 -0.511 -1.204 0.751 ...
\[\log \frac{E(Y|x)}{1 - E(Y|x)} = \beta_0 + x\beta_1\]
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate, family = binomial())
summary(glm_binomial_logit)
##
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79924 -0.48354 -0.21025 -0.04274 2.32135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0296 1.0429 -4.823 1.42e-06 ***
## lcavol 1.9798 0.4543 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.35 on 96 degrees of freedom
## Residual deviance: 64.14 on 95 degrees of freedom
## AIC: 68.14
##
## Number of Fisher Scoring iterations: 6
density function: \[f(y) = p(x)^y(1 - p(x))^{1 - y}\]
likelihood function: \[L(\beta_0, \beta_1) = \prod_{i = 1}^n p(x_i)^{y_i}(1 - p(x_i))^{1 - y_i}\]
logistic regression, logit link \[\log \frac{p(x)}{1 - p(x)} = \beta_0 + x\beta_1\]
log likelihood function:
\(\begin{aligned} l(\beta_0, \beta_1) &= \log\prod_{i = 1}^n p(x_i)^{y_i}(1 - p(x_i))^{(1 - y_i)} \\ & = \sum_{i=1}^n y_i \log p(x_i) + (1 - y_i) \log (1 - p(x_i)) \\ & = \sum_{i=1}^n \log (1 - p(x_i)) + \sum_{i=1}^n y_i \log \frac{p(x_i)}{1 - p(x_i)} \\ & = \sum_{i=1}^n -\log(1 + \exp(\beta_0 + x_i \beta_1)) + \sum_{i=1}^n y_i (\beta_0 + x_i\beta_1)\\ \end{aligned}\)
For unconstrained, smooth univariate convex optimization \[\min f(x)\] where \(f\) is convex, twice differentable, \(x \in \mathbb{R}\), \(f \in \mathbb{R}\). For gradient descent, start with intial value \(x^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[x^{(k)} = x^{(k - 1)} - t_k f'(x^{(k - 1)})\]
For Newton’s method, start with intial value \(x^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[x^{(k)} = x^{(k - 1)} - (f''(x^{(k - 1)}))^{-1} f'(x^{(k - 1)})\] Where \(f''(x^{(k - 1)})\) is the second derivative of \(f\) at \(x^{(k - 1)}\). It is also refered as Hessian matrix for higher dimension (e.g. \(x \in \mathbb{R}^p\))
For gradient decent step at \(x\), we minimize the quadratic approximation \[f(y) \approx f(x) + f'(x)(y - x) + \frac{1}{2t}(y - x)^2\] over \(y\), which yield the update \(x^{(k)} = x^{(k - 1)} - tf'(x^{(k-1)})\)
Newton’s method uses a better quadratic approximation: \[f(y) \approx f(x) + f'(x)(y - x) + \frac{1}{2}f''(x)(y - x)^2\] minimizing over \(y\) yield \(x^{(k)} = x^{(k - 1)} - (f''(x^{(k-1)}))^{-1}f'(x^{(k-1)})\)
f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
h <- function(x) exp(x) + 12 * x^2 ## Hessian function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function
x <- x0 <- 0.8; x_old <- 0; t <- 0.1; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)
while(abs(x - x_old) > error){
k <- k + 1
x_old <- x
x <- x_old - 1/h(x_old)*g(x_old)
trace <- c(trace, x) ## collecting results
points(x, f(x), col=k, pch=19)
}
lines(trace, f(trace), lty = 2)
## [1] 0.8000000 0.3685707 -0.1665553 -0.8686493 -0.6362012 -0.5432407
## [7] -0.5285880 -0.5282520 -0.5282519
f <- function(x) exp(x) + x^4 ## original function
par(mfrow=c(1,2))
title_grad <- paste("gradient decent, nstep =", length(trace_grad))
curve(f, from = -1, to = 1, lwd=2, main = title_grad)
points(trace_grad, f(trace_grad), col=seq_along(trace_grad), pch=19)
lines(trace_grad, f(trace_grad), lty = 2)
title_newton <- paste("Newton's method, nstep =", length(trace_newton))
curve(f, from = -1, to = 1, lwd=2, main = title_newton)
points(trace_newton, f(trace_newton), col=seq_along(trace_newton), pch=19)
lines(trace_newton, f(trace_newton), lty = 2)
Note that the pure Newton’s method uses \(t = 1\)
f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
h <- function(x) exp(x) + 12 * x^2 ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function
x <- x0 <- 1; x_old <- 0; t0 <- 1; k <- 0; error <- 1e-6; beta = 0.8; alpha <- 0.4
trace <- x
points(x0, f(x0), col=1, pch=1)
while(abs(x - x_old) > error & k < 30){
k <- k + 1
x_old <- x
## backtracking
t <- t0
while(f(x_old - t*g(x_old)/h(x_old)) > f(x_old) - alpha * t * g(x_old)^2/h(x_old)){
t <- t * beta
}
x <- x_old - t*g(x_old)/h(x_old)
trace <- c(trace, x) ## collecting results
points(x, f(x), col=k, pch=19)
segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}
## [1] 1.00000000 0.54354171 0.09465801 -0.63631402 -0.54326938 -0.52858926
## [7] -0.52825205 -0.52825187
library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate, family = binomial())
summary(glm_binomial_logit)
##
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79924 -0.48354 -0.21025 -0.04274 2.32135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0296 1.0429 -4.823 1.42e-06 ***
## lcavol 1.9798 0.4543 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.35 on 96 degrees of freedom
## Residual deviance: 64.14 on 95 degrees of freedom
## AIC: 68.14
##
## Number of Fisher Scoring iterations: 6
Method | Gradient decent | Newton’s method |
---|---|---|
Order | First order method | Second Order method |
Criterion | smooth \(f\) | double smooth \(f\) |
Convergence (# iterations) | Slow | Fast |
Iteration cost | cheap (compute gradient) | moderate to expensive (Compute Hessian) |
What if \(\beta\) is not a scalar but instead a vector (i.e. \(\beta \in \mathbb{R}^p\))?
How to calculate the gradient (derivative) for multivariate case?
How to calculate the Hessian matrix for multivariate case?
Then \[\nabla_\beta f(\beta) = \frac{\partial f(\beta)}{\partial \beta} = (\frac{\partial f(\beta)}{\partial \beta_1}, \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \frac{\partial f(\beta)}{\partial \beta_p})^\top \in \mathbb{R}^p\]
Then \[\nabla_\beta f(\beta) = \frac{\partial f(\beta)}{\partial \beta} = (\frac{\partial f(\beta)}{\partial \beta_1}, \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \frac{\partial f(\beta)}{\partial \beta_p})^\top \in \mathbb{R}^p\]
\[\Delta_\beta f(\beta) = \nabla^2_\beta f(\beta) = \nabla_\beta\frac{\partial f(\beta)}{\partial \beta} = (\nabla_\beta \frac{\partial f(\beta)}{\partial \beta_1}, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_p})^\top \]
\[\Delta_\beta f(\beta) = \begin{pmatrix} \frac{\partial^2 f(\beta)}{\partial \beta_1^2} & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial\beta_p}\\ \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_1 } & \frac{\partial^2 f(\beta)}{\partial \beta_2^2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_p} \\ \ldots &\ldots &\ldots &\ldots\\ \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_1} & \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial\beta_p^2} \end{pmatrix} \]
\[f(x,y) = 4x^2 + y^2 + 2xy - x - y\]
Gradient
\[\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})^\top = \binom{8x + 2y - 1}{2y +2x - 1} \in \mathbb{R}^2\]
\[\Delta f = \begin{pmatrix} 8 & 2 \\ 2 & 2 \end{pmatrix} \in \mathbb{R}^{2\times 2}\]
Gradient decent: choose initial \(\beta^{(0)} \in \mathbb{R}^p\), repeat: \[\beta^{(k)} = \beta^{(k - 1)} - t\times \nabla_\beta f(\beta^{(k - 1)}), k = 1,2,3,\ldots\]
f <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(x,y){
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
xx <- yy <- seq(-20, 20, len=1000)
zz <- outer(xx, yy, f)
# Contour plots
contour(xx,yy,zz, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0;
y <- y0 <- -10; y_old <- 0;
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t <- 0.1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
k <- k + 1
preXY <- curXY
curXY <- preXY - t*g(preXY[1], preXY[2])
trace <- c(trace, list(curXY)) ## collecting results
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
}
## [1] 97
f <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(x,y){
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
B <- 1000
x <- x0 <- 0.8; x_old <- 0;
y <- y0 <- -0.10; y_old <- 0;
plot(x,xlim=c(-B,B), ylim=c(-B,B))
#contour(x,y,z, xlim = c(-100,100), ylim = c(-200,200), nlevels = 10, levels = seq(1,10000,length.out = 10), main="Contour Plot")
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t <- 0.8; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
## repeat the following chunk of code to see divergence of gradient decent
k <- k + 1
preXY <- curXY
curXY <- preXY - t*g(preXY[1], preXY[2])
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
Linear approximation (first order Taylor expansion) at the current point \(x\). \[f(x + t\Delta x) = f(x) + t\nabla f(x)^\top\Delta x\]
Decrease the slope of the approximation by \(\alpha\)
\[f(x) + t\alpha \nabla f(x)^\top\Delta x\]
A sufficient condition for a proposed \(t\) to guarantee decent (decrease objective function): \[f(x + t\Delta x) < f(x) + t\alpha \nabla f(x)^\top\Delta x\]
Plugin \(\Delta x = -f'(x)\) \[f(x - t\nabla f(x)) < f(x) - \alpha t \times \|(\nabla f(x)\|_2^2\]
f0 <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
f <- function(avec){
x <- avec[1]
y <- avec[2]
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(avec){
x <- avec[1]
y <- avec[2]
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0;
y <- y0 <- -10; y_old <- 0;
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
alpha = 1/3
beta = 1/2
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
k <- k + 1
preXY <- curXY
## backtracking
t <- t0
while(f(preXY - t*g(preXY)) > f(preXY) - alpha * t * l2n(g(preXY))^2){
t <- t * beta
}
curXY <- preXY - t*g(preXY)
trace <- c(trace, list(curXY)) ## collecting results
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
}
## [1] 25
library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate, family = binomial())
summary(glm_binomial_logit)
##
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79924 -0.48354 -0.21025 -0.04274 2.32135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0296 1.0429 -4.823 1.42e-06 ***
## lcavol 1.9798 0.4543 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.35 on 96 degrees of freedom
## Residual deviance: 64.14 on 95 degrees of freedom
## AIC: 68.14
##
## Number of Fisher Scoring iterations: 6
For unconstrained, smooth univariate convex optimization \[\min f(\beta)\] where \(f\) is convex, twice differentiable, \(\beta \in \mathbb{R}^p\), \(f \in \mathbb{R}\). For gradient descent, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - t_k \nabla f(\beta^{(k - 1)})\]
For Newton’s method, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - (\Delta f(\beta^{(k - 1)}))^{-1} \nabla f(\beta^{(k - 1)})\] Where \(\Delta f(\beta^{(k - 1)}) \in \mathbb{R}^{p\times p}\) is the Hessian matrix of \(f\) at \(\beta^{(k - 1)}\).
f0 <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
f <- function(avec){
x <- avec[1]
y <- avec[2]
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(avec){
x <- avec[1]
y <- avec[2]
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
h <- function(avec){
x <- avec[1]
y <- avec[2]
res <- matrix(c(8,2,2,2),2,2) ## Hessian function
return(res)
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0;
y <- y0 <- -10; y_old <- 0;
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
k <- k + 1
preXY <- curXY
curXY <- preXY - solve(h(curXY)) %*% g(preXY)
trace <- c(trace, list(curXY)) ## collecting results
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
}
## [1] 2
\[f(x,y) = \exp(xy) + y^2 + x^4\]
What are the x and y such that \(f(x,y)\) is minimized?
Note that the pure Newton’s method uses \(t = 1\)
library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate, family = binomial())
summary(glm_binomial_logit)
##
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79924 -0.48354 -0.21025 -0.04274 2.32135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0296 1.0429 -4.823 1.42e-06 ***
## lcavol 1.9798 0.4543 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.35 on 96 degrees of freedom
## Residual deviance: 64.14 on 95 degrees of freedom
## AIC: 68.14
##
## Number of Fisher Scoring iterations: 6
\[ \min_{\beta \in \mathbb{R}^p} f(\beta) = \min_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y - X\beta\|_2^2\] where \(y \in \mathbb{R}^n\) and \(X \in \mathbb{R}^{n \times p}\) with columns \(X_1\) (intercept), \(X_2\), \(\ldots\), \(X_p\).
Minimizing over \(\beta_j\) while fixing all \(\beta_i\), \(i \ne j\): \[0 = \frac{\partial f(\beta)}{\partial \beta_j} = X_j^\top (X\beta - y) = X_j^\top (X_j\beta_j + X_{-j}\beta_{-j} - y)\]
We can solve for the coordinate descent updating rule:
\[\beta_j = \frac{X_j^\top (y - X_{-j}\beta_{-j})}{X_j^\top X_j}\]
library(ElemStatLearn)
y <- prostate$lcavol
x0 <- prostate[,-match(c("lcavol", "train"), colnames(prostate))]
x <- cbind(1, as.matrix(x0))
beta <- rep(0,ncol(x))
beta_old <- rnorm(ncol(x))
error <- 1e-6
k <- 0
while(diffChange(beta, beta_old) > error){
k <- k + 1
beta_old <- beta
for(j in 1:length(beta)){
xj <- x[,j]
xj_else <- x[,-j]
beta_j_else <- as.matrix(beta[-j])
beta[j] <- t(xj) %*% (y - xj_else %*% beta_j_else) / sum(xj^2)
}
}
print(beta)
## [1] -2.419610178 -0.025954949 0.022114840 -0.093134994 -0.152983715
## [6] 0.367195579 0.197095411 -0.007114941 0.565822653
##
## Call:
## lm(formula = lcavol ~ . - train, data = prostate)
##
## Coefficients:
## (Intercept) lweight age lbph svi
## -2.420613 -0.025925 0.022113 -0.093138 -0.152966
## lcp gleason pgg45 lpsa
## 0.367180 0.197249 -0.007117 0.565826