Zhiguang Huo (Caleb)
Monday Nov 27, 2017
Gradient decent: choose initial \(\beta^{(0)} \in \mathbb{R}\), repeat: \[\beta^{(k)} = \beta^{(k - 1)} - t\times f'(\beta^{(k - 1)}), k = 1,2,3,\ldots\]
\[f(y) \approx f(\beta^{(k - 1)}) + f'(\beta^{(k - 1)}) \times (y - \beta^{(k - 1)}) + \frac{1}{2t} (y - \beta^{(k - 1)})^2\]
Minimizing the quadratic approximation. \[\beta^{(k)} = \arg_y \min f(y)\]
Set \(f'(\beta^{(k)}) = 0 \Leftrightarrow f'(\beta^{(k - 1)}) + \frac{1}{t} (\beta^{(k)} - \beta^{(k - 1)}) = 0 \Leftrightarrow \beta^{(k)} = \beta^{(k - 1)} - t\times f'(\beta^{(k - 1)})\)
This is exactly the same as the gradient decent procedure.
For unconstrained, smooth univariate convex optimization \[\min f(\beta)\] where \(f\) is convex, twice differentiable, \(\beta \in \mathbb{R}\), \(f \in \mathbb{R}\). For gradient descent, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - t_k f'(\beta^{(k - 1)})\]
For Newton’s method, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - (f''(\beta^{(k - 1)}))^{-1} f'(\beta^{(k - 1)})\] Where \(f''(\beta^{(k - 1)})\) is the second derivative of \(f\) at \(\beta^{(k - 1)}\). It is also referred as Hessian matrix for higher dimension (e.g. \(\beta \in \mathbb{R}^p\))
For gradient decent step at \(\beta\), we minimize the quadratic approximation \[f(y) \approx f(\beta) + f'(\beta)(y - \beta) + \frac{1}{2t}(y - \beta)^2\] over \(y\), which yield the update \(\beta^{(k)} = \beta^{(k - 1)} - tf'(\beta^{(k-1)})\)
Newton’s method uses a better quadratic approximation: \[f(y) \approx f(\beta) + f'(\beta)(y - \beta) + \frac{1}{2}f''(\beta)(y - \beta)^2\] minimizing over \(y\) yield \(\beta^{(k)} = \beta^{(k - 1)} - (f''(\beta^{(k-1)}))^{-1}f'(\beta^{(k-1)})\)
What if \(\beta\) is not a scalar but instead a vector (i.e. \(\beta \in \mathbb{R}^p\))?
How to calculate the gradient (derivative) for multivariate case?
How to calculate the Hessian matrix for multivariate case?
Then \[\nabla_\beta f(\beta) = \frac{\partial f(\beta)}{\partial \beta} = (\frac{\partial f(\beta)}{\partial \beta_1}, \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \frac{\partial f(\beta)}{\partial \beta_p})^\top \in \mathbb{R}^p\]
f <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")Then \[\nabla_\beta f(\beta) = \frac{\partial f(\beta)}{\partial \beta} = (\frac{\partial f(\beta)}{\partial \beta_1}, \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \frac{\partial f(\beta)}{\partial \beta_p})^\top \in \mathbb{R}^p\]
\[\Delta_\beta f(\beta) = \nabla^2_\beta f(\beta) = \nabla_\beta\frac{\partial f(\beta)}{\partial \beta} = (\nabla_\beta \frac{\partial f(\beta)}{\partial \beta_1}, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_p})^\top \]
\[\Delta_\beta f(\beta) = \begin{pmatrix} \frac{\partial^2 f(\beta)}{\partial \beta_1^2} & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial\beta_p}\\ \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_1 } & \frac{\partial^2 f(\beta)}{\partial \beta_2^2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_p} \\ \ldots &\ldots &\ldots &\ldots\\ \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_1} & \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial\beta_p^2} \end{pmatrix} \]
\[f(x,y) = 4x^2 + y^2 + 2xy - x - y\]
Gradient
\[\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})^\top = \binom{8x + 2y - 1}{2y +2x - 1} \in \mathbb{R}^2\]
\[\Delta f = \begin{pmatrix} 8 & 2 \\ 2 & 2 \end{pmatrix} \in \mathbb{R}^{2\times 2}\]
Gradient decent: choose initial \(\beta^{(0)} \in \mathbb{R}^p\), repeat: \[\beta^{(k)} = \beta^{(k - 1)} - t\times \nabla_\beta f(\beta^{(k - 1)}), k = 1,2,3,\ldots\]
f <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}
g <- function(x,y){
  c(8*x + 2*y - 1, 2*y +2*x - 1)
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0; 
y <- y0 <- -10; y_old <- 0; 
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t <- 0.1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
  sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
  deltaVec <- avec - bvec
  sumVec <- avec + bvec
  l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
  k <- k + 1
  preXY <- curXY
  curXY <- preXY - t*g(preXY[1], preXY[2])
  trace <- c(trace, list(curXY)) ## collecting results
  points(curXY[1], curXY[2], col=k, pch=19)
  segments(curXY[1], curXY[2], preXY[1], preXY[2])
}print(k)## [1] 97f <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}
g <- function(x,y){
  c(8*x + 2*y - 1, 2*y +2*x - 1)
}
B <- 1000
plot(x,xlim=c(-B,B), ylim=c(-B,B))
#contour(x,y,z, xlim = c(-100,100), ylim = c(-200,200), nlevels = 10, levels = seq(1,10000,length.out = 10), main="Contour Plot")
x <- x0 <- 0.8; x_old <- 0; 
y <- y0 <- -0.10; y_old <- 0; 
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t <- 0.8; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
  sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
  deltaVec <- avec - bvec
  sumVec <- avec + bvec
  l2n(deltaVec)/l2n(sumVec)
}
## repeat the following chunk of code to see divergence of gradient decent
k <- k + 1
preXY <- curXY
curXY <- preXY - t*g(preXY[1], preXY[2])
points(curXY[1], curXY[2], col=k, pch=19)
segments(curXY[1], curXY[2], preXY[1], preXY[2])f0 <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}
f <- function(avec){
  x <- avec[1]
  y <- avec[2]
  4 * x^2 + y^2 + 2 * x * y - x - y 
}
g <- function(avec){
  x <- avec[1]
  y <- avec[2]  
  c(8*x + 2*y - 1, 2*y +2*x - 1)
}
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0; 
y <- y0 <- -10; y_old <- 0; 
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
alpha = 1/3
beta = 1/2
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
  sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
  deltaVec <- avec - bvec
  sumVec <- avec + bvec
  l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
  k <- k + 1
  preXY <- curXY
  
  ## backtracking
  t <- t0
  while(f(preXY - t*g(preXY)) > f(preXY) - alpha * t * l2n(g(preXY))^2){
    t <- t * beta
  }
  
  curXY <- preXY - t*g(preXY)
  trace <- c(trace, list(curXY)) ## collecting results
  points(curXY[1], curXY[2], col=k, pch=19)
  segments(curXY[1], curXY[2], preXY[1], preXY[2])
  
}print(k)## [1] 25library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate,  family = binomial())
summary(glm_binomial_logit)## 
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79924  -0.48354  -0.21025  -0.04274   2.32135  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.0296     1.0429  -4.823 1.42e-06 ***
## lcavol        1.9798     0.4543   4.358 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.35  on 96  degrees of freedom
## Residual deviance:  64.14  on 95  degrees of freedom
## AIC: 68.14
## 
## Number of Fisher Scoring iterations: 6For unconstrained, smooth univariate convex optimization \[\min f(\beta)\] where \(f\) is convex, twice differentiable, \(\beta \in \mathbb{R}^p\), \(f \in \mathbb{R}\). For gradient descent, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - t_k \nabla f(\beta^{(k - 1)})\]
For Newton’s method, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - (\Delta f(\beta^{(k - 1)}))^{-1} \nabla f(\beta^{(k - 1)})\] Where \(\Delta f(\beta^{(k - 1)}) \in \mathbb{R}^{p\times p}\) is the Hessian matrix of \(f\) at \(\beta^{(k - 1)}\).
f0 <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}
f <- function(avec){
  x <- avec[1]
  y <- avec[2]
  4 * x^2 + y^2 + 2 * x * y - x - y 
}
g <- function(avec){
  x <- avec[1]
  y <- avec[2]  
  c(8*x + 2*y - 1, 2*y +2*x - 1)
}
h <- function(avec){
  x <- avec[1]
  y <- avec[2]  
  res <- matrix(c(8,2,2,2),2,2)  ## Hessian function
  return(res)  
} 
x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")
x <- x0 <- 8; x_old <- 0; 
y <- y0 <- -10; y_old <- 0; 
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)
l2n <- function(avec){
  sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
  deltaVec <- avec - bvec
  sumVec <- avec + bvec
  l2n(deltaVec)/l2n(sumVec)
}
while(diffChange(curXY, preXY) > error){
  k <- k + 1
  preXY <- curXY
  
  curXY <- preXY - solve(h(curXY)) %*% g(preXY)
  trace <- c(trace, list(curXY)) ## collecting results
  points(curXY[1], curXY[2], col=k, pch=19)
  segments(curXY[1], curXY[2], preXY[1], preXY[2])
}k ## why k is only 2?## [1] 2\[f(x,y) = \exp(xy) + y^2 + x^4\]
What are the x and y such that \(f(x,y)\) is minimized?
Here’s an example of Newton’s method for root finding, generated from the polynomial \(z^4 - 1\), so that the four roots of the function are at 1, -1, i and -i.
 
Note: same color denotes the region where will lead to the same root. Reference: https://www.chiark.greenend.org.uk/~sgtatham/newton/
Note that the pure Newton’s method uses \(t = 1\)
library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate,  family = binomial())
summary(glm_binomial_logit)## 
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79924  -0.48354  -0.21025  -0.04274   2.32135  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.0296     1.0429  -4.823 1.42e-06 ***
## lcavol        1.9798     0.4543   4.358 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.35  on 96  degrees of freedom
## Residual deviance:  64.14  on 95  degrees of freedom
## AIC: 68.14
## 
## Number of Fisher Scoring iterations: 6\[ \min_{\beta \in \mathbb{R}^p} f(\beta) = \min_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y - X\beta\|_2^2\] where \(y \in \mathbb{R}^n\) and \(X \in \mathbb{R}^{n \times p}\) with columns \(X_1\) (intercept), \(X_2\), \(\ldots\), \(X_p\).
Minimizing over \(\beta_j\) while fixing all \(\beta_i\), \(i \ne j\): \[0 = \frac{\partial f(\beta)}{\partial \beta_j} = X_j^\top (X\beta - y) = X_j^\top (X_j\beta_j + X_{-j}\beta_{-j} - y)\]
We can solve for the coordinate descent updating rule:
\[\beta_j = \frac{X_j^\top (y - X_{-j}\beta_{-j})}{X_j^\top X_j}\]
library(ElemStatLearn)
y <- prostate$lcavol
x0 <- prostate[,-match(c("lcavol", "train"), colnames(prostate))]
x <- cbind(1, as.matrix(x0))
beta <- rep(0,ncol(x))
beta_old <- rnorm(ncol(x))
error <- 1e-6
k <- 0
while(diffChange(beta, beta_old) > error){
  k <- k + 1
  beta_old <- beta
  for(j in 1:length(beta)){
    xj <- x[,j]
    xj_else <- x[,-j]
    beta_j_else <- as.matrix(beta[-j])
    beta[j] <- t(xj) %*% (y - xj_else %*% beta_j_else) / sum(xj^2)
  }
}
print(beta)## [1] -2.419610178 -0.025954949  0.022114840 -0.093134994 -0.152983715
## [6]  0.367195579  0.197095411 -0.007114941  0.565822653## compare with lm
lm(lcavol ~ . - train, data=prostate)## 
## Call:
## lm(formula = lcavol ~ . - train, data = prostate)
## 
## Coefficients:
## (Intercept)      lweight          age         lbph          svi  
##   -2.420613    -0.025925     0.022113    -0.093138    -0.152966  
##         lcp      gleason        pgg45         lpsa  
##    0.367180     0.197249    -0.007117     0.565826\[\arg \min_\beta f(\beta) = \arg \min_\beta \frac{1}{2}(y - \beta)^2 + \lambda |\beta|\]
\[0 = \frac{\partial f(\beta)}{\partial \beta} = \beta - y + \lambda \frac{\partial |\beta|}{\partial \beta}\]
\[\beta = S_\lambda(y) = \begin{cases} y - \lambda \text{ if } y > \lambda\\ y + \lambda \text{ if } y < - \lambda\\ 0 \text{ if } -\lambda \le y \le \lambda\\ \end{cases}\]
\(S_\lambda(y)\) is called soft thresholding operator \[S_\lambda(y) = \begin{cases} y - \lambda \text{ if } y > \lambda\\ y + \lambda \text{ if } y < - \lambda\\ 0 \text{ if } -\lambda \le y \le \lambda\\ \end{cases}\]
Hard thresholding operator is defined as \[H_\lambda(y) = \begin{cases} y \text{ if } y > \lambda\\ y \text{ if } y < - \lambda\\ 0 \text{ if } -\lambda \le y \le \lambda\\ \end{cases}\]
Given general problem
\[L(x) = f(x) + \sum_{i=1}^m u_i h_i(x) + \sum_{j=1}^r v_j l_j(x) \]
The Karush-Kuhn-Tucher conditions or KKT conditions are:
\(x = \max \{0, 1/v - \alpha_i \}\)
We need \(x\) to be feasible, \(1^\top x = 1\), so \[\sum_{i=1}^n \max\{0, 1/v - \alpha_i\} = 1\] Which is univariate equation. Can solve \(v\) using binary search.
n <- 10
alpha <- runif(n)
x0 <- runif(n)
x1 <- x0/sum(x0)
f <- function(x){
  -sum(log(alpha + x))
}
grad <- function(x){
  -1/(alpha + x)
}
ui1 <- diag(n)
ui2 <- rep(1,n)
ui3 <- rep(-1,n)
ui <- rbind(ui1, ui2, ui3)
ci1 <- rep(0,n)
ci2 <- 1 - 1e-12
ci3 <- -1 - 1e-12
ci <- c(ci1,ci2,ci3)
constrOptim(x1, f, grad, ui, ci)## $par
##  [1] 0.03854230 0.05224637 0.19736554 0.08935701 0.07468487 0.03603157
##  [7] 0.22017346 0.02620023 0.17276648 0.09263218
## 
## $value
## [1] 3.937191
## 
## $counts
## [1] 0
## 
## $convergence
## [1] 0
## 
## $message
## NULL
## 
## $outer.iterations
## [1] 1
## 
## $barrier.value
## [1] 0.0003081647knitr::purl("multiOptimization.rmd", output = "multiOptimization.R ", documentation = 2)## 
## 
## processing file: multiOptimization.rmd## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |...                                                              |   5%
  |                                                                       
  |......                                                           |  10%
  |                                                                       
  |..........                                                       |  15%
  |                                                                       
  |.............                                                    |  20%
  |                                                                       
  |................                                                 |  25%
  |                                                                       
  |....................                                             |  30%
  |                                                                       
  |.......................                                          |  35%
  |                                                                       
  |..........................                                       |  40%
  |                                                                       
  |.............................                                    |  45%
  |                                                                       
  |................................                                 |  50%
  |                                                                       
  |....................................                             |  55%
  |                                                                       
  |.......................................                          |  60%
  |                                                                       
  |..........................................                       |  65%
  |                                                                       
  |..............................................                   |  70%
  |                                                                       
  |.................................................                |  75%
  |                                                                       
  |....................................................             |  80%
  |                                                                       
  |.......................................................          |  85%
  |                                                                       
  |..........................................................       |  90%
  |                                                                       
  |..............................................................   |  95%
  |                                                                       
  |.................................................................| 100%## output file: multiOptimization.R## [1] "multiOptimization.R "