Zhiguang Huo (Caleb)
Monday Nov 27, 2017
Gradient decent: choose initial \(\beta^{(0)} \in \mathbb{R}\), repeat: \[\beta^{(k)} = \beta^{(k - 1)} - t\times f'(\beta^{(k - 1)}), k = 1,2,3,\ldots\]
\[f(y) \approx f(\beta^{(k - 1)}) + f'(\beta^{(k - 1)}) \times (y - \beta^{(k - 1)}) + \frac{1}{2t} (y - \beta^{(k - 1)})^2\]
Minimizing the quadratic approximation. \[\beta^{(k)} = \arg_y \min f(y)\]
Set \(f'(\beta^{(k)}) = 0 \Leftrightarrow f'(\beta^{(k - 1)}) + \frac{1}{t} (\beta^{(k)} - \beta^{(k - 1)}) = 0 \Leftrightarrow \beta^{(k)} = \beta^{(k - 1)} - t\times f'(\beta^{(k - 1)})\)
This is exactly the same as the gradient decent procedure.
For unconstrained, smooth univariate convex optimization \[\min f(\beta)\] where \(f\) is convex, twice differentiable, \(\beta \in \mathbb{R}\), \(f \in \mathbb{R}\). For gradient descent, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - t_k f'(\beta^{(k - 1)})\]
For Newton’s method, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - (f''(\beta^{(k - 1)}))^{-1} f'(\beta^{(k - 1)})\] Where \(f''(\beta^{(k - 1)})\) is the second derivative of \(f\) at \(\beta^{(k - 1)}\). It is also referred as Hessian matrix for higher dimension (e.g. \(\beta \in \mathbb{R}^p\))
For gradient decent step at \(\beta\), we minimize the quadratic approximation \[f(y) \approx f(\beta) + f'(\beta)(y - \beta) + \frac{1}{2t}(y - \beta)^2\] over \(y\), which yield the update \(\beta^{(k)} = \beta^{(k - 1)} - tf'(\beta^{(k-1)})\)
Newton’s method uses a better quadratic approximation: \[f(y) \approx f(\beta) + f'(\beta)(y - \beta) + \frac{1}{2}f''(\beta)(y - \beta)^2\] minimizing over \(y\) yield \(\beta^{(k)} = \beta^{(k - 1)} - (f''(\beta^{(k-1)}))^{-1}f'(\beta^{(k-1)})\)
What if \(\beta\) is not a scalar but instead a vector (i.e. \(\beta \in \mathbb{R}^p\))?
How to calculate the gradient (derivative) for multivariate case?
How to calculate the Hessian matrix for multivariate case?
Then \[\nabla_\beta f(\beta) = \frac{\partial f(\beta)}{\partial \beta} = (\frac{\partial f(\beta)}{\partial \beta_1}, \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \frac{\partial f(\beta)}{\partial \beta_p})^\top \in \mathbb{R}^p\]
f <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
Then \[\nabla_\beta f(\beta) = \frac{\partial f(\beta)}{\partial \beta} = (\frac{\partial f(\beta)}{\partial \beta_1}, \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \frac{\partial f(\beta)}{\partial \beta_p})^\top \in \mathbb{R}^p\]
\[\Delta_\beta f(\beta) = \nabla^2_\beta f(\beta) = \nabla_\beta\frac{\partial f(\beta)}{\partial \beta} = (\nabla_\beta \frac{\partial f(\beta)}{\partial \beta_1}, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_p})^\top \]
\[\Delta_\beta f(\beta) = \begin{pmatrix} \frac{\partial^2 f(\beta)}{\partial \beta_1^2} & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial\beta_p}\\ \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_1 } & \frac{\partial^2 f(\beta)}{\partial \beta_2^2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_p} \\ \ldots &\ldots &\ldots &\ldots\\ \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_1} & \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial\beta_p^2} \end{pmatrix} \]
\[f(x,y) = 4x^2 + y^2 + 2xy - x - y\]
Gradient
\[\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})^\top = \binom{8x + 2y - 1}{2y +2x - 1} \in \mathbb{R}^2\]
\[\Delta f = \begin{pmatrix} 8 & 2 \\ 2 & 2 \end{pmatrix} \in \mathbb{R}^{2\times 2}\]
Gradient decent: choose initial \(\beta^{(0)} \in \mathbb{R}^p\), repeat: \[\beta^{(k)} = \beta^{(k - 1)} - t\times \nabla_\beta f(\beta^{(k - 1)}), k = 1,2,3,\ldots\]
f <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(x,y){
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0;
y <- y0 <- -10; y_old <- 0;
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t <- 0.1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
k <- k + 1
preXY <- curXY
curXY <- preXY - t*g(preXY[1], preXY[2])
trace <- c(trace, list(curXY)) ## collecting results
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
}
print(k)
## [1] 97
f <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(x,y){
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
B <- 1000
plot(x,xlim=c(-B,B), ylim=c(-B,B))
#contour(x,y,z, xlim = c(-100,100), ylim = c(-200,200), nlevels = 10, levels = seq(1,10000,length.out = 10), main="Contour Plot")
x <- x0 <- 0.8; x_old <- 0;
y <- y0 <- -0.10; y_old <- 0;
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t <- 0.8; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
## repeat the following chunk of code to see divergence of gradient decent
k <- k + 1
preXY <- curXY
curXY <- preXY - t*g(preXY[1], preXY[2])
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
f0 <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
f <- function(avec){
x <- avec[1]
y <- avec[2]
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(avec){
x <- avec[1]
y <- avec[2]
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0;
y <- y0 <- -10; y_old <- 0;
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
alpha = 1/3
beta = 1/2
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
k <- k + 1
preXY <- curXY
## backtracking
t <- t0
while(f(preXY - t*g(preXY)) > f(preXY) - alpha * t * l2n(g(preXY))^2){
t <- t * beta
}
curXY <- preXY - t*g(preXY)
trace <- c(trace, list(curXY)) ## collecting results
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
}
print(k)
## [1] 25
library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate, family = binomial())
summary(glm_binomial_logit)
##
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79924 -0.48354 -0.21025 -0.04274 2.32135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0296 1.0429 -4.823 1.42e-06 ***
## lcavol 1.9798 0.4543 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.35 on 96 degrees of freedom
## Residual deviance: 64.14 on 95 degrees of freedom
## AIC: 68.14
##
## Number of Fisher Scoring iterations: 6
For unconstrained, smooth univariate convex optimization \[\min f(\beta)\] where \(f\) is convex, twice differentiable, \(\beta \in \mathbb{R}^p\), \(f \in \mathbb{R}\). For gradient descent, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - t_k \nabla f(\beta^{(k - 1)})\]
For Newton’s method, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - (\Delta f(\beta^{(k - 1)}))^{-1} \nabla f(\beta^{(k - 1)})\] Where \(\Delta f(\beta^{(k - 1)}) \in \mathbb{R}^{p\times p}\) is the Hessian matrix of \(f\) at \(\beta^{(k - 1)}\).
f0 <- function(x,y){
4 * x^2 + y^2 + 2 * x * y - x - y
}
f <- function(avec){
x <- avec[1]
y <- avec[2]
4 * x^2 + y^2 + 2 * x * y - x - y
}
g <- function(avec){
x <- avec[1]
y <- avec[2]
c(8*x + 2*y - 1, 2*y +2*x - 1)
}
h <- function(avec){
x <- avec[1]
y <- avec[2]
res <- matrix(c(8,2,2,2),2,2) ## Hessian function
return(res)
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0;
y <- y0 <- -10; y_old <- 0;
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
deltaVec <- avec - bvec
sumVec <- avec + bvec
l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
k <- k + 1
preXY <- curXY
curXY <- preXY - solve(h(curXY)) %*% g(preXY)
trace <- c(trace, list(curXY)) ## collecting results
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])
}
k ## why k is only 2?
## [1] 2
\[f(x,y) = \exp(xy) + y^2 + x^4\]
What are the x and y such that \(f(x,y)\) is minimized?
Here’s an example of Newton’s method for root finding, generated from the polynomial \(z^4 - 1\), so that the four roots of the function are at 1, -1, i and -i.
Note: same color denotes the region where will lead to the same root. Reference: https://www.chiark.greenend.org.uk/~sgtatham/newton/
Note that the pure Newton’s method uses \(t = 1\)
library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate, family = binomial())
summary(glm_binomial_logit)
##
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.79924 -0.48354 -0.21025 -0.04274 2.32135
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.0296 1.0429 -4.823 1.42e-06 ***
## lcavol 1.9798 0.4543 4.358 1.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 101.35 on 96 degrees of freedom
## Residual deviance: 64.14 on 95 degrees of freedom
## AIC: 68.14
##
## Number of Fisher Scoring iterations: 6
\[ \min_{\beta \in \mathbb{R}^p} f(\beta) = \min_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y - X\beta\|_2^2\] where \(y \in \mathbb{R}^n\) and \(X \in \mathbb{R}^{n \times p}\) with columns \(X_1\) (intercept), \(X_2\), \(\ldots\), \(X_p\).
Minimizing over \(\beta_j\) while fixing all \(\beta_i\), \(i \ne j\): \[0 = \frac{\partial f(\beta)}{\partial \beta_j} = X_j^\top (X\beta - y) = X_j^\top (X_j\beta_j + X_{-j}\beta_{-j} - y)\]
We can solve for the coordinate descent updating rule:
\[\beta_j = \frac{X_j^\top (y - X_{-j}\beta_{-j})}{X_j^\top X_j}\]
library(ElemStatLearn)
y <- prostate$lcavol
x0 <- prostate[,-match(c("lcavol", "train"), colnames(prostate))]
x <- cbind(1, as.matrix(x0))
beta <- rep(0,ncol(x))
beta_old <- rnorm(ncol(x))
error <- 1e-6
k <- 0
while(diffChange(beta, beta_old) > error){
k <- k + 1
beta_old <- beta
for(j in 1:length(beta)){
xj <- x[,j]
xj_else <- x[,-j]
beta_j_else <- as.matrix(beta[-j])
beta[j] <- t(xj) %*% (y - xj_else %*% beta_j_else) / sum(xj^2)
}
}
print(beta)
## [1] -2.419610178 -0.025954949 0.022114840 -0.093134994 -0.152983715
## [6] 0.367195579 0.197095411 -0.007114941 0.565822653
## compare with lm
lm(lcavol ~ . - train, data=prostate)
##
## Call:
## lm(formula = lcavol ~ . - train, data = prostate)
##
## Coefficients:
## (Intercept) lweight age lbph svi
## -2.420613 -0.025925 0.022113 -0.093138 -0.152966
## lcp gleason pgg45 lpsa
## 0.367180 0.197249 -0.007117 0.565826
\[\arg \min_\beta f(\beta) = \arg \min_\beta \frac{1}{2}(y - \beta)^2 + \lambda |\beta|\]
\[0 = \frac{\partial f(\beta)}{\partial \beta} = \beta - y + \lambda \frac{\partial |\beta|}{\partial \beta}\]
\[\beta = S_\lambda(y) = \begin{cases} y - \lambda \text{ if } y > \lambda\\ y + \lambda \text{ if } y < - \lambda\\ 0 \text{ if } -\lambda \le y \le \lambda\\ \end{cases}\]
\(S_\lambda(y)\) is called soft thresholding operator \[S_\lambda(y) = \begin{cases} y - \lambda \text{ if } y > \lambda\\ y + \lambda \text{ if } y < - \lambda\\ 0 \text{ if } -\lambda \le y \le \lambda\\ \end{cases}\]
Hard thresholding operator is defined as \[H_\lambda(y) = \begin{cases} y \text{ if } y > \lambda\\ y \text{ if } y < - \lambda\\ 0 \text{ if } -\lambda \le y \le \lambda\\ \end{cases}\]
Given general problem
\[L(x) = f(x) + \sum_{i=1}^m u_i h_i(x) + \sum_{j=1}^r v_j l_j(x) \]
The Karush-Kuhn-Tucher conditions or KKT conditions are:
\(x = \max \{0, 1/v - \alpha_i \}\)
We need \(x\) to be feasible, \(1^\top x = 1\), so \[\sum_{i=1}^n \max\{0, 1/v - \alpha_i\} = 1\] Which is univariate equation. Can solve \(v\) using binary search.
n <- 10
alpha <- runif(n)
x0 <- runif(n)
x1 <- x0/sum(x0)
f <- function(x){
-sum(log(alpha + x))
}
grad <- function(x){
-1/(alpha + x)
}
ui1 <- diag(n)
ui2 <- rep(1,n)
ui3 <- rep(-1,n)
ui <- rbind(ui1, ui2, ui3)
ci1 <- rep(0,n)
ci2 <- 1 - 1e-12
ci3 <- -1 - 1e-12
ci <- c(ci1,ci2,ci3)
constrOptim(x1, f, grad, ui, ci)
## $par
## [1] 0.03854230 0.05224637 0.19736554 0.08935701 0.07468487 0.03603157
## [7] 0.22017346 0.02620023 0.17276648 0.09263218
##
## $value
## [1] 3.937191
##
## $counts
## [1] 0
##
## $convergence
## [1] 0
##
## $message
## NULL
##
## $outer.iterations
## [1] 1
##
## $barrier.value
## [1] 0.0003081647
knitr::purl("multiOptimization.rmd", output = "multiOptimization.R ", documentation = 2)
##
##
## processing file: multiOptimization.rmd
##
|
| | 0%
|
|... | 5%
|
|...... | 10%
|
|.......... | 15%
|
|............. | 20%
|
|................ | 25%
|
|.................... | 30%
|
|....................... | 35%
|
|.......................... | 40%
|
|............................. | 45%
|
|................................ | 50%
|
|.................................... | 55%
|
|....................................... | 60%
|
|.......................................... | 65%
|
|.............................................. | 70%
|
|................................................. | 75%
|
|.................................................... | 80%
|
|....................................................... | 85%
|
|.......................................................... | 90%
|
|.............................................................. | 95%
|
|.................................................................| 100%
## output file: multiOptimization.R
## [1] "multiOptimization.R "