Optimization

In mathematics, “optimization” or “mathematical programming” refers to the selection of a best element (with regard to some criterion) from some set of available alternatives.
A typical optimization problem usually consists of maximizing or minimizing a real function (objective function) by systematically choosing input values under certain constraint.
“Convex programming” studies the case when the objective function is convex (minimization) or concave (maximization) and the constraint set is convex.

Convex function

Convex function properties:

Second derivative always greater than 0
Tangent lines always underestimate the function
\(\forall t \in (0,1)\), \(tf(a) + (1 - t)f(b) > f(ta + (1 - t)b)\)

Convex optimization always leads to a global minimizer
Non-convex optimization usually leads to a local minimizer

Optimization functions in R

optimize(): One dimensional optimization, no gradient or Hessian
optim(): General purpose optimization, five possible methods, gradient optional
constrOptim(): Minimization of a function subject to linear inequality constraints, gradient optional
nlm(): Non-linear minimization, can optionally include the gradient and hessian of the function as attributes of the objective function
nlminb(): Minimization using PORT routines, can optionally include the gradient and Hessian of the objective function as additional arguments

Univariate optimization example

Suppose our objective function is \[f(x) = e^x + x^4\] What is the minimum value of \(f(x)\), what is the correponding \(x\)?

f <- function(x) exp(x) + x^4
curve(f, from = -1, to = 1)

Use R optimize() function to perform the optimization

f <- function(x) exp(x) + x^4
xmin <- optimize(f, interval = c(-10, 10))
xmin

## $minimum
## [1] -0.5282468
## 
## $objective
## [1] 0.6675038

curve(f, from = -1, to = 1)
points(xmin$minimum, xmin$objective, col="red", pch=19)

Gradient decent

Consider an unconstrained minimization of \(f\), we want to find \(x^*\) such that \[f(x^*) = \min_{x\in \mathbb{R}} f(x)\]
Another way to formulate the problem is to find \(x^*\) such that \[x^* = \arg_{x\in \mathbb{R}} \min f(x)\]

Gradient decent: choose initial \(x^{(0)} \in \mathbb{R}\), repeat: \[x^{(k)} = x^{(k - 1)} - t\times f'(x^{(k - 1)}), k = 1,2,3,\ldots\]

Until \(|x^{(k)} - x^{(k - 1)}| < \varepsilon\), (e.g. \(\varepsilon = 10^{-6}\)).
\(f'(x^{(k - 1)})\) is the gradient (derivative of \(f\)) evaluated at \(x^{(k - 1)}\).
\(t\) is a step size for gradient decent procedure.

Interpretation

At each iteration \(k\), consider Taylor expansion at \(x^{(k - 1)}\): \[f(y) = f(x^{(k - 1)}) + f'(x^{(k - 1)}) \times (y - x^{(k - 1)}) + \frac{1}{2} f''(x^{(k - 1)}) (y - x^{(k - 1)})^2 + \ldots\]
Quadratic approximation:
- Replacing \(f''(x^{(k - 1)})\) with \(\frac{1}{t}\)
- Ingore higher order terms

\[f(y) \approx f(x^{(k - 1)}) + f'(x^{(k - 1)}) \times (y - x^{(k - 1)}) + \frac{1}{2t} (y - x^{(k - 1)})^2\]

Minimizing the quadratic approximation. \[x^{(k)} = \arg_y \min f(y)\]
- Set \(f'(x^{(k)}) = 0 \Leftrightarrow f'(x^{(k - 1)}) + \frac{1}{t} (x^{(k)} - x^{(k - 1)}) = 0 \Leftrightarrow x^{(k)} = x^{(k - 1)} - t\times f'(x^{(k - 1)})\)
- This is exactly the same as the gradient decent procedure.

Visual interpretation

iterate 1: \(x^{(1)} = x^{(0)} - t\times f'(x^{(0)})\)
- quadratic approximation at \(x_0\),
- minimizer of the quadratic approximation is \(x_1\)
iterate 2: \(x^{(2)} = x^{(1)} - t\times f'(x^{(1)})\)
- quadratic approximation at \(x_1\),
- minimizer of the quadratic approximation is \(x_2\)
…

Gradient decent on our motivating example

f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function

x <- x0 <- 0.8; x_old <- 0; t <- 0.1; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)

while(abs(x - x_old) > error){ ## there is a scale issue. 
  ## E.g., if x = 1e-7, x_old = 1e-6, relative change is large, absolute change is small
  ## We can monitor abs(x - x_old) / (abs(x) + abs(x_old))
  k <- k + 1
  x_old <- x
  x <- x_old - t*g(x_old)
  trace <- c(trace, x) ## collecting results
  points(x, f(x), col=k, pch=19)
}

trace_grad <- trace
print(trace)

##  [1]  0.80000000  0.37264591  0.20678990  0.08028037 -0.02828567 -0.12548768
##  [7] -0.21290391 -0.28986708 -0.35496119 -0.40719158 -0.44673749 -0.47504573
## [13] -0.49435026 -0.50702281 -0.51511484 -0.52018513 -0.52332289 -0.52524940
## [19] -0.52642641 -0.52714331 -0.52757914 -0.52784381 -0.52800441 -0.52810183
## [25] -0.52816091 -0.52819673 -0.52821844 -0.52823161 -0.52823959 -0.52824443
## [31] -0.52824736 -0.52824914 -0.52825021 -0.52825087

A non-convex example

If the objective function has multiple local minimums.
Gradient will still work, but may fall into a local minimum (might be global minimum).
kmeans algorithm is a non-convex minimization problem.

Discussion, will gradient decent algorithm always converge?

No, see the example

Minimize \(f(x) = x^2\)
Initial point \(x_0 = 1\)
stepsize \(t = 1.1\)

f <- function(x) x^2 ## original function
g <- function(x) 2*x ## gradient function
curve(f, from = -10, to = 10, lwd=2) ## visualize the objective function

x <- x0 <- 1; x_old <- 0; t <- 1.1; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)

while(abs(x - x_old) > error & k < 30){ ## there is a scale issue. 
  ## E.g., if x = 1e-7, x_old = 1e-6, relative change is large, absolute change is small
  ## We can monitor abs(x - x_old) / (abs(x) + abs(x_old))
  k <- k + 1
  x_old <- x
  x <- x_old - t*g(x_old)
  trace <- c(trace, x) ## collecting results
  points(x, f(x), col=k, pch=19)
  segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}

print(trace)

##  [1]    1.000000   -1.200000    1.440000   -1.728000    2.073600   -2.488320
##  [7]    2.985984   -3.583181    4.299817   -5.159780    6.191736   -7.430084
## [13]    8.916100  -10.699321   12.839185  -15.407022   18.488426  -22.186111
## [19]   26.623333  -31.948000   38.337600  -46.005120   55.206144  -66.247373
## [25]   79.496847  -95.396217  114.475460 -137.370552  164.844662 -197.813595
## [31]  237.376314

Change to another stepsize

Minimize \(f(x) = x^2\)
Initial point \(x_0 = 1\)
stepsize \(t = 0.2\)

f <- function(x) x^2 ## original function
g <- function(x) 2*x ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function

x <- x0 <- 1; x_old <- 0; t <- 0.2; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)

while(abs(x - x_old) > error & k < 30){
  k <- k + 1
  x_old <- x
  x <- x_old - t*g(x_old)
  trace <- c(trace, x) ## collecting results
  points(x, f(x), col=k, pch=19)
  segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}

print(trace)

##  [1] 1.000000e+00 6.000000e-01 3.600000e-01 2.160000e-01 1.296000e-01
##  [6] 7.776000e-02 4.665600e-02 2.799360e-02 1.679616e-02 1.007770e-02
## [11] 6.046618e-03 3.627971e-03 2.176782e-03 1.306069e-03 7.836416e-04
## [16] 4.701850e-04 2.821110e-04 1.692666e-04 1.015600e-04 6.093597e-05
## [21] 3.656158e-05 2.193695e-05 1.316217e-05 7.897302e-06 4.738381e-06
## [26] 2.843029e-06 1.705817e-06 1.023490e-06

Change to another stepsize

Minimize \(f(x) = x^2\)
Initial point \(x_0 = 1\)
stepsize \(t = 0.01\)

f <- function(x) x^2 ## original function
g <- function(x) 2*x ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function

x <- x0 <- 1; x_old <- 0; t <- 0.01; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)

while(abs(x - x_old) > error & k < 30){
  k <- k + 1
  x_old <- x
  x <- x_old - t*g(x_old)
  trace <- c(trace, x) ## collecting results
  points(x, f(x), col=k, pch=19)
  segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}

print(trace)

##  [1] 1.0000000 0.9800000 0.9604000 0.9411920 0.9223682 0.9039208 0.8858424
##  [8] 0.8681255 0.8507630 0.8337478 0.8170728 0.8007314 0.7847167 0.7690224
## [15] 0.7536419 0.7385691 0.7237977 0.7093218 0.6951353 0.6812326 0.6676080
## [22] 0.6542558 0.6411707 0.6283473 0.6157803 0.6034647 0.5913954 0.5795675
## [29] 0.5679762 0.5566167 0.5454843

How to select stepsize \(t\), backtracking line search

Linear approximation (first order Taylor expansion) at the current point \(x\). \[f(x + t\Delta x) = f(x) + tf'(x)\Delta x\]
Decrease the slope of the approximation by \(\alpha\)

\[f(x) + t\alpha f'(x)\Delta x\]

A sufficient condition for a proposed \(t\) to guarantee decent (decrease objective function): \[f(x + t\Delta x) < f(x) + t\alpha f'(x)\Delta x\]
Plugin \(\Delta x = -f'(x)\) \[f(x - tf'(x)) < f(x) - t\alpha (f'(x))^2\]

Backtracking line search Algorithm

Fix parameter \(0 < \beta < 1\) and \(0 < \alpha \le 1/2\)
At each gradient decent iteration, start with \(t=1\), while \[f(x - t\times f'(x)) > f(x) - \alpha t \times (f'(x))^2\] Update \(t = \beta t\)
When the above interation stops, use \(t\) as the step size

Note: In each iteration, you may always want to initialize \(t = 1\) before backtracking line search.

Implementation of back-tracking for gradient decent on the motivating example

Minimize \(f(x) = \exp(x) + x^4\)
gradient \(g(x) = \exp(x) + 4 x^3\)
Initial point \(x_0 = 1\)
initial stepsize \(t_0 = 1\), \(\alpha = 1/3\), \(\beta = 1/2\)
- At each step, start with \(t = t_0\),
- if \(f(x - tg(x)) > f(x) - \alpha t (g(x))^2\), set \(t = \beta t\)
- otherwise, use \(t\) as step size

f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function

x <- x0 <- 1; x_old <- 0; t0 <- 1; k <- 0; error <- 1e-6; beta = 0.8; alpha <- 0.4
trace <- x
points(x0, f(x0), col=1, pch=1)

while(abs(x - x_old) > error & k < 30){
  k <- k + 1
  x_old <- x
  
  ## backtracking
  t <- t0
  while(f(x_old - t*g(x_old)) > f(x_old) - alpha * t * g(x_old)^2){
    t <- t * beta
  }
  
  x <- x_old - t*g(x_old)
  trace <- c(trace, x) ## collecting results
  points(x, f(x), col=k, pch=19)
  segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}

print(trace)

## [1]  1.00000000  0.09828748 -0.61024238 -0.53353118 -0.52803658 -0.52825877
## [7] -0.52825165 -0.52825188

Exercise (will be on HW), solve for logistic regression

library(ElemStatLearn)
data2 <- prostate[,c("svi", "lcavol")]
str(data2)

## 'data.frame':    97 obs. of  2 variables:
##  $ svi   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcavol: num  -0.58 -0.994 -0.511 -1.204 0.751 ...

Y: svi, binary 0/1.
X: lcavol, continuous.
logistic regression

\[\log \frac{E(Y|x)}{1 - E(Y|x)} = \beta_0 + x\beta_1\]

Exercise (will be on HW), solve for logistic regression

glm_binomial_logit <- glm(svi ~ lcavol, data = prostate,  family = binomial())
summary(glm_binomial_logit)

## 
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79924  -0.48354  -0.21025  -0.04274   2.32135  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.0296     1.0429  -4.823 1.42e-06 ***
## lcavol        1.9798     0.4543   4.358 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.35  on 96  degrees of freedom
## Residual deviance:  64.14  on 95  degrees of freedom
## AIC: 68.14
## 
## Number of Fisher Scoring iterations: 6

We set \(\beta_0 = -5.0296\).
Try to optimize \(\beta_1\), the coefficient for lcavol using gradient decent method
compare with glm result.
compare with the result using optimize() function.

Exercise (will be on HW), some hints

density function: \[f(y) = p(x)^y(1 - p(x))^{1 - y}\]
likelihood function: \[L(\beta_0, \beta_1) = \prod_{i = 1}^n p(x_i)^{y_i}(1 - p(x_i))^{1 - y_i}\]
logistic regression, logit link \[\log \frac{p(x)}{1 - p(x)} = \beta_0 + x\beta_1\]
log likelihood function:

\(\begin{aligned} l(\beta_0, \beta_1) &= \log\prod_{i = 1}^n p(x_i)^{y_i}(1 - p(x_i))^{(1 - y_i)} \\ & = \sum_{i=1}^n y_i \log p(x_i) + (1 - y_i) \log (1 - p(x_i)) \\ & = \sum_{i=1}^n \log (1 - p(x_i)) + \sum_{i=1}^n y_i \log \frac{p(x_i)}{1 - p(x_i)} \\ & = \sum_{i=1}^n -\log(1 + \exp(\beta_0 + x_i \beta_1)) + \sum_{i=1}^n y_i (\beta_0 + x_i\beta_1)\\ \end{aligned}\)

Newton’s method

For unconstrained, smooth univariate convex optimization \[\min f(x)\] where \(f\) is convex, twice differentable, \(x \in \mathbb{R}\), \(f \in \mathbb{R}\). For gradient descent, start with intial value \(x^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[x^{(k)} = x^{(k - 1)} - t_k f'(x^{(k - 1)})\]
For Newton’s method, start with intial value \(x^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[x^{(k)} = x^{(k - 1)} - (f''(x^{(k - 1)}))^{-1} f'(x^{(k - 1)})\] Where \(f''(x^{(k - 1)})\) is the second derivative of \(f\) at \(x^{(k - 1)}\). It is also refered as Hessian matrix for higher dimension (e.g. \(x \in \mathbb{R}^p\))

Newton’s method interpretation

For gradient decent step at \(x\), we minimize the quadratic approximation \[f(y) \approx f(x) + f'(x)(y - x) + \frac{1}{2t}(y - x)^2\] over \(y\), which yield the update \(x^{(k)} = x^{(k - 1)} - tf'(x^{(k-1)})\)
Newton’s method uses a better quadratic approximation: \[f(y) \approx f(x) + f'(x)(y - x) + \frac{1}{2}f''(x)(y - x)^2\] minimizing over \(y\) yield \(x^{(k)} = x^{(k - 1)} - (f''(x^{(k-1)}))^{-1}f'(x^{(k-1)})\)

Newton’s method on our motivating example

f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
h <- function(x) exp(x) + 12 * x^2 ## Hessian function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function

x <- x0 <- 0.8; x_old <- 0; t <- 0.1; k <- 0; error <- 1e-6
trace <- x
points(x0, f(x0), col=1, pch=1)

while(abs(x - x_old) > error){
  k <- k + 1
  x_old <- x
  x <- x_old - 1/h(x_old)*g(x_old)
  trace <- c(trace, x) ## collecting results
  points(x, f(x), col=k, pch=19)
}
lines(trace, f(trace), lty = 2)

trace_newton <- trace
print(trace)

## [1]  0.8000000  0.3685707 -0.1665553 -0.8686493 -0.6362012 -0.5432407 -0.5285880
## [8] -0.5282520 -0.5282519

Compare Gradient decent with Newton’s method

f <- function(x) exp(x) + x^4 ## original function
par(mfrow=c(1,2))
title_grad <- paste("gradient decent, nstep =", length(trace_grad))
curve(f, from = -1, to = 1, lwd=2, main = title_grad) 
points(trace_grad, f(trace_grad), col=seq_along(trace_grad), pch=19)
lines(trace_grad, f(trace_grad), lty = 2)

title_newton <- paste("Newton's method, nstep =", length(trace_newton))
curve(f, from = -1, to = 1, lwd=2, main = title_newton) 
points(trace_newton, f(trace_newton), col=seq_along(trace_newton), pch=19)
lines(trace_newton, f(trace_newton), lty = 2)

Alternative interpretation about Newton’s method

Aternative interpretation of Newton’s step: we seek a direction \(v\) so that \(f'(x + v) = 0\).
By Taylor expansion: \[0 = f'(x + v) \approx f'(x) + f''(x)v\]
Solve for \(v\), we have \(v = -(f''(x))^{-1} f'(x)\)

Backtracking line search for Newton’s method

Usually pure Newton’s method works very well (do not necessary need to use backtracking)
We have seen pure Newton’s method, which need not to decent at all iterations.
In practice, we can use Newton’s method with backtracking search, which repeats \[x^{(k)} = x^{(k - 1)} - t(f''(x^{(k-1)}))^{-1}f'(x^{(k-1)})\]
Note that the pure Newton’s method uses \(t = 1\)
Algorithm:
- Fix parameter \(0 < \beta < 1\) and \(0 < \alpha \le 1/2\)
- At each Newton’s iteration \(x\), start with \(t=1\), while \[f(x + t \Delta x) > f(x) + \alpha t \times f'(x) \Delta x\] Update \(t = \beta t\)
- In other word (set \(\Delta x = - (f''(x))^{-1} f'(x)\)), while \[f(x - t (f''(x))^{-1} f'(x)) > f(x) - \alpha t (f''(x))^{-1} (f'(x))^2\] Update \(t = \beta t\)

Implementation of back-tracking for Newton’s method

Minimize \(f(x) = \exp(x) + x^4\)
gradient \(g(x) = \exp(x) + 4 x^3\)
Hessian \(h(x) = \exp(x) + 12 x^2\)
Initial point \(x_0 = 1\)
initial stepsize \(t_0 = 1\), \(\alpha = 1/3\), \(\beta = 1/2\)
- At each step, start with \(t = t_0\),
- if \(f(x - t g(x) / h(x) ) > f(x) - \alpha t (g(x))^2/h(x)\), set \(t = \beta t\)
- otherwise, use \(t\) as step size

f <- function(x) exp(x) + x^4 ## original function
g <- function(x) exp(x) + 4 * x^3 ## gradient function
h <- function(x) exp(x) + 12 * x^2 ## gradient function
curve(f, from = -1, to = 1, lwd=2) ## visualize the objective function

x <- x0 <- 1; x_old <- 0; t0 <- 1; k <- 0; error <- 1e-6; beta = 0.8; alpha <- 0.4
trace <- x
points(x0, f(x0), col=1, pch=1)

while(abs(x - x_old) > error & k < 30){
  k <- k + 1
  x_old <- x
  
  ## backtracking
  t <- t0
  while(f(x_old - t*g(x_old)/h(x_old)) > f(x_old) - alpha * t * g(x_old)^2/h(x_old)){
    t <- t * beta
  }
  
  x <- x_old - t*g(x_old)/h(x_old)
  trace <- c(trace, x) ## collecting results
  points(x, f(x), col=k, pch=19)
  segments(x0=x, y0 = f(x), x1 = x_old, y1 = f(x_old))
}

print(trace)

## [1]  1.00000000  0.54354171  0.09465801 -0.63631402 -0.54326938 -0.52858926
## [7] -0.52825205 -0.52825187

Exercise (will be on HW), solve for logistic regression

library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate,  family = binomial())
summary(glm_binomial_logit)

## 
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79924  -0.48354  -0.21025  -0.04274   2.32135  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.0296     1.0429  -4.823 1.42e-06 ***
## lcavol        1.9798     0.4543   4.358 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.35  on 96  degrees of freedom
## Residual deviance:  64.14  on 95  degrees of freedom
## AIC: 68.14
## 
## Number of Fisher Scoring iterations: 6

We set \(\beta_0 = -5.0296\).
Try to optimize \(\beta_1\), the coefficient for lcavol using Newton’s method
compare with gradient method

Comparison between gradient decent and Newton’s method

Method	Gradient decent	Newton’s method
Order	First order method	Second Order method
Criterion	smooth \(f\)	double smooth \(f\)
Convergence (# iterations)	Slow	Fast
Iteration cost	cheap (compute gradient)	moderate to expensive (Compute Hessian)

Multivariate Case

What if \(\beta\) is not a scalar but instead a vector (i.e. \(\beta \in \mathbb{R}^p\))?
Will gradient decent still work?
- Yes
How to calculate the gradient (derivative) for multivariate case?
Will Newton’s method still work?
- Yes
How to calculate the Hessian matrix for multivariate case?

Gradient in multivariate case

\(\beta = (\beta_1, \beta_2, \ldots, \beta_p)^\top \in \mathbb{R}^p\) is a \(p\)-dimensional column vector
\(f(\beta) \in \mathbb{R}\) is a function which map \(\mathbb{R}^p \rightarrow \mathbb{R}\)
Suppose \(f(\beta)\) is differentiable with respect to \(\beta\).

Then \[\nabla_\beta f(\beta) = \frac{\partial f(\beta)}{\partial \beta} = (\frac{\partial f(\beta)}{\partial \beta_1}, \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \frac{\partial f(\beta)}{\partial \beta_p})^\top \in \mathbb{R}^p\]

Example \[f(x,y) = 4x^2 + y^2 + 2xy - x - y\]
- \(\frac{\partial f}{\partial x} = 8x + 2y - 1\)
- \(\frac{\partial f}{\partial y} = 2y +2x - 1\)
- \(\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})^\top = \binom{8x + 2y - 1}{2y +2x - 1} \in \mathbb{R}^2\)

Visualization of the example function

f <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}


xx <- yy <- seq(-20, 20, len=1000)
zz <- outer(xx, yy, f)
# Contour plots
contour(xx,yy,zz, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")

Hessian in multivariate case

\(\beta = (\beta_1, \beta_2, \ldots, \beta_p)^\top \in \mathbb{R}^p\) is a \(p\)-dimensional column vector
\(f(\beta) \in \mathbb{R}\) is a function which map \(\mathbb{R}^p \rightarrow \mathbb{R}\)
Suppose \(f(\beta)\) is twice differentiable with respect to \(\beta\).

\[\Delta_\beta f(\beta) = \nabla^2_\beta f(\beta) = \nabla_\beta\frac{\partial f(\beta)}{\partial \beta} = (\nabla_\beta \frac{\partial f(\beta)}{\partial \beta_1}, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_2}, \ldots, \nabla_\beta \frac{\partial f(\beta)}{\partial \beta_p})^\top \]

\[\Delta_\beta f(\beta) = \begin{pmatrix} \frac{\partial^2 f(\beta)}{\partial \beta_1^2} & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_1 \partial\beta_p}\\ \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_1 } & \frac{\partial^2 f(\beta)}{\partial \beta_2^2} & \ldots & \frac{\partial^2 f(\beta)}{\partial \beta_2 \partial \beta_p} \\ \ldots &\ldots &\ldots &\ldots\\ \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_1} & \frac{\partial^2 f(\beta)}{\partial \beta_p \partial \beta_2} & \ldots & \frac{\partial^2 f(\beta)}{\partial\beta_p^2} \end{pmatrix} \]

Hessian in multivariate case example

\[f(x,y) = 4x^2 + y^2 + 2xy - x - y\]

Gradient
- \(\frac{\partial f}{\partial x} = 8x + 2y - 1\)
- \(\frac{\partial f}{\partial y} = 2y +2x - 1\)

\[\nabla f = (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y})^\top = \binom{8x + 2y - 1}{2y +2x - 1} \in \mathbb{R}^2\]

Hessian
- \(\frac{\partial^2 f}{\partial x^2} = 8\)
- \(\frac{\partial^2 f}{\partial y^2} = 2\)
- \(\frac{\partial^2 f}{\partial x \partial y} = 2\)

\[\Delta f = \begin{pmatrix} 8 & 2 \\ 2 & 2 \end{pmatrix} \in \mathbb{R}^{2\times 2}\]

Gradient decent for multivariate case

Gradient decent: choose initial \(\beta^{(0)} \in \mathbb{R}^p\), repeat: \[\beta^{(k)} = \beta^{(k - 1)} - t\times \nabla_\beta f(\beta^{(k - 1)}), k = 1,2,3,\ldots\]

Until \(\frac{\|\beta^{(k)} - \beta^{(k - 1)}\|_2}{\|\beta^{(k)} + \beta^{(k - 1)}\|_2} < \varepsilon\), (e.g. \(\varepsilon = 10^{-6}\)).
\(\nabla_\beta f(\beta^{(k - 1)})\) is the gradient (derivative of \(f\)) evaluated at \(\beta^{(k - 1)}\).
\(t\) is a step size for gradient decent procedure.

Gradient decent on our motivating example

f <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}

g <- function(x,y){
  c(8*x + 2*y - 1, 2*y +2*x - 1)
}

xx <- yy <- seq(-20, 20, len=1000)
zz <- outer(xx, yy, f)
# Contour plots
contour(xx,yy,zz, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")


x <- x0 <- 8; x_old <- 0; 
y <- y0 <- -10; y_old <- 0; 
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t <- 0.1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)

l2n <- function(avec){
  sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
  deltaVec <- avec - bvec
  sumVec <- avec + bvec
  l2n(deltaVec)/l2n(sumVec)
}

while(diffChange(curXY, preXY) > error){
  k <- k + 1
  preXY <- curXY
  curXY <- preXY - t*g(preXY[1], preXY[2])
  trace <- c(trace, list(curXY)) ## collecting results
  points(curXY[1], curXY[2], col=k, pch=19)
  segments(curXY[1], curXY[2], preXY[1], preXY[2])
}

print(k)

## [1] 97

How to select step size \(t\), backtracking line search

Linear approximation (first order Taylor expansion) at the current point \(x\). \[f(x + t\Delta x) = f(x) + t\nabla f(x)^\top\Delta x\]
Decrease the slope of the approximation by \(\alpha\)

\[f(x) + t\alpha \nabla f(x)^\top\Delta x\]

A sufficient condition for a proposed \(t\) to guarantee decent (decrease objective function): \[f(x + t\Delta x) < f(x) + t\alpha \nabla f(x)^\top\Delta x\]
Plugin \(\Delta x = -f'(x)\) \[f(x - t\nabla f(x)) < f(x) - \alpha t \times \|(\nabla f(x)\|_2^2\]
Algorithm:
- Fix parameter \(0 < \gamma < 1\) and \(0 < \alpha \le 1/2\)
- At each gradient decent iteration, start with \(t=1\), while \[f(\beta - t\times \nabla f(\beta)) > f(\beta) - \alpha t \times \|(\nabla f(\beta)\|_2^2\] Update \(t = \gamma t\)

Implementation of back-tracking for gradient decent

Minimize \(f(x,y) = 4x^2 + y^2 + 2xy - x - y\)
gradient \(\nabla f = \binom{8x + 2y - 1}{2y +2x - 1}\)
Initial point \(x_0 = 8\); \(y_0 = -10\)
initial step size \(t_0 = 1\), \(\alpha = 1/3\), \(\gamma = 1/2\)
- At each step, start with \(t = t_0\),
- if \(f(x - tg(x)) > f(x) - \alpha t \|g(x)\|_2^2\), set \(t = \gamma t\)
- otherwise, use \(t\) as step size

f0 <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}

f <- function(avec){
  x <- avec[1]
  y <- avec[2]
  4 * x^2 + y^2 + 2 * x * y - x - y 
}

g <- function(avec){
  x <- avec[1]
  y <- avec[2]  
  c(8*x + 2*y - 1, 2*y +2*x - 1)
}

x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")


x <- x0 <- 8; x_old <- 0; 
y <- y0 <- -10; y_old <- 0; 
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
alpha = 1/3
beta = 1/2
trace <- list(curXY)
points(x0,y0, col=1, pch=1)

l2n <- function(avec){
  sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
  deltaVec <- avec - bvec
  sumVec <- avec + bvec
  l2n(deltaVec)/l2n(sumVec)
}

while(diffChange(curXY, preXY) > error){
  k <- k + 1
  preXY <- curXY
  
  ## backtracking
  t <- t0
  while(f(preXY - t*g(preXY)) > f(preXY) - alpha * t * l2n(g(preXY))^2){
    t <- t * beta
  }
  
  curXY <- preXY - t*g(preXY)
  trace <- c(trace, list(curXY)) ## collecting results
  points(curXY[1], curXY[2], col=k, pch=19)
  segments(curXY[1], curXY[2], preXY[1], preXY[2])
  
}

print(k)

## [1] 25

Exercise, solve for logistic regression

library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate,  family = binomial())
summary(glm_binomial_logit)

## 
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79924  -0.48354  -0.21025  -0.04274   2.32135  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.0296     1.0429  -4.823 1.42e-06 ***
## lcavol        1.9798     0.4543   4.358 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.35  on 96  degrees of freedom
## Residual deviance:  64.14  on 95  degrees of freedom
## AIC: 68.14
## 
## Number of Fisher Scoring iterations: 6

Optimize \(\beta_1\), the coefficient for lcavol and the intercept \(\beta_0\) simultaneously using gradient decent method.

Newton’s method for multivariate case

For unconstrained, smooth univariate convex optimization \[\min f(\beta)\] where \(f\) is convex, twice differentiable, \(\beta \in \mathbb{R}^p\), \(f \in \mathbb{R}\). For gradient descent, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - t_k \nabla f(\beta^{(k - 1)})\]
For Newton’s method, start with initial value \(\beta^{(0)}\) and repeat the following \((k = 1,2,3,\ldots)\) until converge \[\beta^{(k)} = \beta^{(k - 1)} - (\Delta f(\beta^{(k - 1)}))^{-1} \nabla f(\beta^{(k - 1)})\] Where \(\Delta f(\beta^{(k - 1)}) \in \mathbb{R}^{p\times p}\) is the Hessian matrix of \(f\) at \(\beta^{(k - 1)}\).

Implement multivariate Newton’s method

f0 <- function(x,y){
  4 * x^2 + y^2 + 2 * x * y - x - y 
}

f <- function(avec){
  x <- avec[1]
  y <- avec[2]
  4 * x^2 + y^2 + 2 * x * y - x - y 
}

g <- function(avec){
  x <- avec[1]
  y <- avec[2]  
  c(8*x + 2*y - 1, 2*y +2*x - 1)
}

h <- function(avec){
  x <- avec[1]
  y <- avec[2]  
  res <- matrix(c(8,2,2,2),2,2)  ## Hessian function
  return(res)  
} 

x <- y <- seq(-20, 20, len=1000)
z <- outer(x, y, f0)
# Contour plots
contour(x,y,z, xlim = c(-10,10), ylim = c(-20,20), nlevels = 10, levels = seq(1,100,length.out = 10), main="Contour Plot")


x <- x0 <- 8; x_old <- 0; 
y <- y0 <- -10; y_old <- 0; 
curXY <- c(x,y)
preXY <- c(x_old, y_old)
t0 <- 1; k <- 0; error <- 1e-6
trace <- list(curXY)
points(x0,y0, col=1, pch=1)

l2n <- function(avec){
  sqrt(sum(avec^2))
}
diffChange <- function(avec, bvec){
  deltaVec <- avec - bvec
  sumVec <- avec + bvec
  l2n(deltaVec)/l2n(sumVec)
}

while(diffChange(curXY, preXY) > error){
  k <- k + 1
  preXY <- curXY
  
  curXY <- preXY - solve(h(curXY)) %*% g(preXY)
  trace <- c(trace, list(curXY)) ## collecting results
  points(curXY[1], curXY[2], col=k, pch=19)
  segments(curXY[1], curXY[2], preXY[1], preXY[2])
}

k ## why k is only 2?

## [1] 2

Exercise

\[f(x,y) = \exp(xy) + y^2 + x^4\]

What are the x and y such that \(f(x,y)\) is minimized?

Use Newton’s method to solve this problem

Backtracking line search for Newton’s method

Usually pure Newton’s method works very well (do not necessary need to use backtracking)
We have seen pure Newton’s method, which need not to decent at all iterations.
In practice, we can use Newton’s method with backtracking search, which repeats \[\beta^{(k)} = \beta^{(k - 1)} - t (\Delta f(\beta^{(k - 1)}))^{-1} \nabla f(\beta^{(k - 1)})\]
Note that the pure Newton’s method uses \(t = 1\)
Algorithm:
- Fix parameter \(0 < \gamma < 1\) and \(0 < \alpha \le 1/2\)
- At each Newton’s iteration \(x\), start with \(t=1\), while \[f(\beta + t \Delta \beta) > f(\beta) + \alpha t \times \nabla f(\beta)^\top \Delta \beta\] Update \(t = \gamma t\)
- In other word (set \(\Delta \beta = - (\Delta f(\beta))^{-1} \nabla f(\beta)\)), while \[f(\beta - t \Delta f(\beta)^{-1} \nabla f(\beta)) > f(\beta) - \alpha t (\nabla f(\beta))^\top (\Delta f(\beta))^{-1} \nabla f(\beta)\] Update \(t = \gamma t\)

Exercise, solve for logistic regression

library(ElemStatLearn)
glm_binomial_logit <- glm(svi ~ lcavol, data = prostate,  family = binomial())
summary(glm_binomial_logit)

## 
## Call:
## glm(formula = svi ~ lcavol, family = binomial(), data = prostate)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.79924  -0.48354  -0.21025  -0.04274   2.32135  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -5.0296     1.0429  -4.823 1.42e-06 ***
## lcavol        1.9798     0.4543   4.358 1.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.35  on 96  degrees of freedom
## Residual deviance:  64.14  on 95  degrees of freedom
## AIC: 68.14
## 
## Number of Fisher Scoring iterations: 6

Optimize \(\beta_1\), the coefficient for lcavol and the intercept \(\beta_0\) simultaneously using Newton’s method.

Coordinate decent

Suppose \(\beta \in \mathbb{R}^P\) and \(f(\beta)\) is a mapping from \(\mathbb{R}^P\) to \(R\).
We can use coordinate descent to find a minimizer.
Start with an initial guess \(\beta^{(0)} = (\beta^{(0)}_1, \beta^{(0)}_2, \ldots, \beta^{(0)}_p)^\top\) and repeat the following:
- \(\beta_1^{(k)} = \arg \min_{\beta_1} f(\beta_1, \beta_2^{(k-1)}, \beta_3^{(k-1)}, \ldots, \beta_p^{(k-1)})\)
- \(\beta_2^{(k)} = \arg \min_{\beta_2} f(\beta_1^{(k)}, \beta_2, \beta_3^{(k-1)}, \ldots, \beta_p^{(k-1)})\)
- \(\beta_3^{(k)} = \arg \min_{\beta_3} f(\beta_1^{(k)}, \beta_2^{(k)}, \beta_3, \ldots, \beta_p^{(k-1)})\)
- \(\ldots\)
- \(\beta_p^{(k)} = \arg \min_{\beta_p} f(\beta_1^{(k)}, \beta_2^{(k)}, \beta_3^{(k)}, \ldots, \beta_p)\)
Continue with \(k=1,2,3,\ldots\) until converge.
- Note that after \(\beta_j^{(k)}\) has been updated, we use the new value for future update.

Example on linear regression

Consider linear regression problem:

\[ \min_{\beta \in \mathbb{R}^p} f(\beta) = \min_{\beta \in \mathbb{R}^p} \frac{1}{2} \|y - X\beta\|_2^2\] where \(y \in \mathbb{R}^n\) and \(X \in \mathbb{R}^{n \times p}\) with columns \(X_1\) (intercept), \(X_2\), \(\ldots\), \(X_p\).

Minimizing over \(\beta_j\) while fixing all \(\beta_i\), \(i \ne j\): \[0 = \frac{\partial f(\beta)}{\partial \beta_j} = X_j^\top (X\beta - y) = X_j^\top (X_j\beta_j + X_{-j}\beta_{-j} - y)\]
We can solve for the coordinate descent updating rule:

\[\beta_j = \frac{X_j^\top (y - X_{-j}\beta_{-j})}{X_j^\top X_j}\]

Biostatistical Computing, PHC 6068

Optimization

Outline

Optimization

Convex function

Optimization functions in R

Univariate optimization example

Use R optimize() function to perform the optimization

Gradient decent

Interpretation

Visual interpretation

Gradient decent on our motivating example

A non-convex example

Discussion, will gradient decent algorithm always converge?

No, see the example

Change to another stepsize

Change to another stepsize

How to select stepsize \(t\), backtracking line search

How to select stepsize \(t\), backtracking line search

Backtracking line search Algorithm

Implementation of back-tracking for gradient decent on the motivating example

Exercise (will be on HW), solve for logistic regression

Exercise (will be on HW), solve for logistic regression

Exercise (will be on HW), some hints

Newton’s method

Newton’s method interpretation

Newton’s method on our motivating example

Compare Gradient decent with Newton’s method

Alternative interpretation about Newton’s method

Backtracking line search for Newton’s method

Implementation of back-tracking for Newton’s method

Exercise (will be on HW), solve for logistic regression

Comparison between gradient decent and Newton’s method

Multivariate Case

Gradient in multivariate case

Visualization of the example function

Hessian in multivariate case

Hessian in multivariate case example

Gradient decent for multivariate case

Gradient decent on our motivating example

How to select step size \(t\), backtracking line search

How to select step size \(t\), backtracking line search

Implementation of back-tracking for gradient decent

Exercise, solve for logistic regression

Newton’s method for multivariate case

Implement multivariate Newton’s method

Exercise

Backtracking line search for Newton’s method

Exercise, solve for logistic regression

Coordinate decent

Example on linear regression

implement coordinate decent using prostate cancer data

Reference