Biostatistical Computing, PHC 6068

Monte Carlo methods

Zhiguang Huo (Caleb)

Monday Nov 30, 2020

Sampling method

Monte Carlo methods.
- Direct sampling.
- Rejection sampling.
- Importance sampling.
- Sampling-importance resampling.
Markov chain Monte Carlo (MCMC).
- Metropolis-Hasting.
- Gibbs sampling.

Notation

Denote \(p\), \(q\) as unnormalized density function.
- E.g. \(p(x) = x(1 - x)\), \(x \in (0, 1)\).
Denote \(p^*\), \(q^*\) as normalized density function.
- E.g. \(p^*(x) = \frac{p(x)}{\int_x p(x) dx} = 6x(1 - x)\), \(x \in (0, 1)\).
- E.g. \(q^*(x) = dnorm(x,0,1) = \frac{1}{\sqrt{2\pi}} \exp(-\frac{x^2}{2})\).

Our targets

Target 1, To generate Monte Carlo samples \(x_m\) from a given probability distribution \(p^*(x)\) or \(p(x)\).
Target 2, To estimate expectations of functions under this distribution, for example \[\mathbb{E}(g (x) | p^*(x)) = \int g(x) p^*(x) dx,\]
- Examples:
- \(\mathbb{E} (x | p^*(x)) \doteq A\)
- \(\mathbb{V}\mbox{ar} (x | p^*(x)) = \mathbb{E} [(x - A)^2| p^*(x)]\)

A simulation approach

Problem: We want to estimate \[\mathbb{E}(g (x) | p^*(x)) = \int g(x) p^*(x) dx,\] Given distribution \(p^*(x)\).

Examples: \(\mathbb{E} (x | p^*(x))\) or \(\mathbb{V}\mbox{ar} (x | p^*(x))\)

To generate samples \(x_m\) from a given probability distribution \(p^*(x)\).
- sample using R: (e.g. rnorm).
- CDF transformation: sample from UNIF(0,1). Then use inverse CDF transformation.
To estimate expectations of functions under this emperical distribution \[\mathbb{E}(g (x) | p^*(x)) \approx \frac{1}{M} \sum_{m=1}^M g(x^{(m)})\]
- \(M\) is number of Monte Carlo samples.
- Via law of large number, this is valid.

Un-normalized density function problems

Unnormalized density function: \(p(x) = \exp [ 0.4(x-0.4)^2 - 0.08x^4 ]\)
- \(x \in (-4, 4)\)

p <- function(x, a=.4, b=.08){
  res <- exp(a*(x-a)^2 - b*x^4)
  res[res < -4 | res > 4] <- 0
  res
}
x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

Hard to compute the normalization factor \(Z\).

integrate(f = p, lower = -4, upper = 4)

## 7.852178 with absolute error < 9.1e-06

Even if we know \(Z\), it is still challenge to draw samples.
A direct solution: partition the distribution into bins and direct sampling.

Direct Sampling

Partition the distribution into bins and perform direct sampling.

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

x2 <- seq(-4, 4, 0.1)
plot(x,p(x),type="n",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))
segments(x2,0,x2,p(x2))

Diffculty
- Higher-dimensional: say \(p = 5\)
- For each dimension, we divide the domain equally into \(50\) bins
- There will be \(50^{5}\) sampling space, huge!

Rejection sampling

\(p^*(x)\) is difficult to directly draw samples, but \(p(x)\) is easy to evaluate..

(e.g., \(p(x) = \exp [ 0.4(x-0.4)^2 - 0.08x^4 ]\))

Sample from a simpler distribution \(q^*(x)\).
Rejection sampling algorithm:

\(x \sim q^*(x)\)
accept \(x\) with prob \(p(x)/c q^*(x)\):
- Sample \(u \sim \mbox{UNIF}(0,1)\), accept if \(p(x)/c q^*(x) > u\).
Repeat Step 1 and 2 many times.

Rejection sampling

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

x <- seq(-4, 4, 0.01)
qstar <- function(x, C = 30){
  C*dnorm(x,sd = 3) 
}
plot(x,p(x),type="l", ylim = c(0,5))
curve(qstar,add = T)
text(0, 5, expression({q^"*"} (x) == N (x , 0, 3^2) ))
text(0, 4.5, expression({cq^"*"} (x) == 30* N (x , 0, 3^2) ))
text(1, 2, expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))
x0 <- -2.5 ## if we sampled -2.5 from the proposal distribution
segments(x0,0,x0,qstar(x0),col=2)
N <- 10
for(i in 1:N){
  set.seed(i)
  ay <- runif(1,0,qstar(x0))
  acol = ifelse(ay < p(x0),2,4)
  points(x0,ay,col=acol,pch=19)
}

Rejection sampling

Proof: \[\begin{align*} p^*(x) &= \frac{p(x)}{Z} \\ &= \frac{p(x)}{\int_x p(x) dx} \\ &= \frac{[p(x)/c q^*(x)]q^*(x)}{\int_x [p(x)/c q^*(x)]q^*(x)dx} \\ \end{align*}\]

Interpretation of the numerator:

\(q^*(x):\) Sampling from the proposed distribution.
\(p(x)/c q^*(x):\) Rejection probability.

Rejection sampling

## rejection sampling
#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.1)
qstar <- function(x){
  dnorm(x,sd = 3) 
}
# we can find M in this case:
C <- round(max(p(x)/qstar(x))) + 1; C

## [1] 28

# number of samples
N <- 1000
# generate proposals and u
x.h <- rnorm( N, sd = 3)
u <- runif( N )
acc <- u < p(x.h) / (C * qstar(x.h))
x.acc <- x.h[ acc ]
# how many proposals are accepted
sum( acc ) /N

## [1] 0.285

# calculate some statistics
c(m=mean(x.acc), s=sd(x.acc))

##          m          s 
## -0.6207873  1.4258200

par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc,1))/length(x.acc))

Discussion: What does the acceptance rate depend on?

Importance sampling

Importance sampling is not a method for generating samples from \(p(x)\) (target 1), it is just a method for estimating the expectation of a function \(g(x)\) (target 2).

Goal: want to calculate expectation of \(\phi(x)\) under \(p^*(x)\).
Sampling from \(p^*(x)\) is hard.
Suppose we can sample from a simpler proposal distribution \(q^*\) instead.
If \(q^*\) dominates \(p^*\) (i.e., \(q^*(x)>0\) whenever \(p^*(x)>0\)), we can sample from \(q^*\) and reweight: \(w(x) = \frac{p^*(x)}{q^*(x)}\)

Direct simulation approach

Underlying distribution: \(p(x)\) or \(p^*(x) = \frac{p(x)}{Z}\).
Function of interest \(\phi (x)\).

\[\mathbb{E} (\phi (x) | p^* ) = \int \phi (x) p^*(x) dx\]

If we can sample \(x_m\) from \(p^*(x)\), then we can use \(\frac{1}{M} \sum_{m=1}^M \phi(x_m)\) to estimate \(\mathbb{E} (\phi (x) | p^* )\)

\[\hat{\mathbb{E}} (\phi (x) | p^* ) = \frac{1}{M}\sum_{m=1}^M\phi (x_m) \]

Importance sampling, algorithm

Underlying distribution: \(p(x)\) or \(p^*(x) = \frac{p(x)}{Z}\).
Function of interest \(\phi (x)\).

\[\mathbb{E} (\phi (x) | p^* ) = \int \phi (x) p^*(x) dx\]

If we cannot sample \(x_m\) from \(p^*(x)\),
- Rely on a proposed distribution (Sampler): \(q^*(x)\).

\[\begin{align*} \mathbb{E} (\phi (x) | p^* ) &= \int \phi (x) p^*(x) dx\\ &= \frac{\int \phi (x) p^*(x) dx}{\int p^*(x) dx}\\ &= \frac{\int [\phi (x) p(x)/Z] dx}{\int [p(x)/Z] dx}\\ &= \frac{\int [\phi (x) p(x)/q^*(x)] q^*(x) dx}{\int [p(x)/q^*(x)] q^*(x) dx}, \end{align*}\]

\(\mathbb{E} (\phi (x) | p^* )\) can be estimated using M draws \(x_1, \ldots, x_M\) from \(q^*(x)\) by the following expression.

\[\hat{\mathbb{E}} (\phi (x) | p^* ) = \frac{\frac{1}{M} \sum_{m=1}^M[\phi (x_m) p(x_m)/q^*(x_m)] }{ \frac{1}{M} \sum_{m=1}^M[p(x_m)/q^*(x_m)]}\]

\(w(x_m) =\frac{p(x_m)}{q^*(x_m)}\)

\[\hat{\mathbb{E}} (\phi (x) | p^* ) = \frac{ \sum_{m=1}^M \phi(x_m) w(x_m)}{ \sum_{m=1}^M w(x_m)} \]

Importance ratio: \(\frac{w(x_m)}{ \sum_{m=1}^M w(x_m)}\)
when \(q^* = p\), the regular mean estimator is a special case of the importance sampling.

Importance sampling, examples

Problem setting:

par(mfrow=c(1,2), mar=c(2,2,2,1))

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

phi <- function(x){ (- 1/3*x^3 + 1/2*x^2 + 12*x - 12) / 30 + 1.3}
x <- seq(-4, 4, 0.01)
plot(x,phi(x),type="l",main= expression(phi(x)))

\(p(x) = \exp [0.4 (x - 0.4) ^ 2 - 0.08 x^4]\)
\(\phi (x) = (- 1/3x^3 + 1/2x^2 + 12x - 12) / 30 + 1.3\) = right panel.

Underlying solution

ep <- function(x) p(x)*phi(x)
truthE <- integrate(f = ep, lower = -4, upper = 4)$value/integrate(f = p, lower = -4, upper = 4)$value
truthE

## [1] 0.6971733

Importance sampling, examples (2)

q.r <- rnorm
q.d <- dnorm

par(mfrow=c(1,2))
plot(x,q.d(x),type="l",main='sampler distribution Gaussian')
curve(p, from = -4,to = 4 ,col=2 ,  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

M <- 1000
x.m <- q.r(M)
ww <- p(x.m) / q.d(x.m)
qq <- ww / sum(ww) ## importance ratio
x.g <- phi(x.m)
sum(x.g * qq)

## [1] 0.7022795

Number of samples for importance sampling

M <- 10^seq(1,7,length.out = 30)

result.g <- numeric(length(M))
for(i in 1:length(M)){
  aM <- M[i]
  x.m <- q.r(aM)
  ww <- p(x.m) / q.d(x.m)
  
  qq.g <- ww / sum(ww)
  x.g <- phi(x.m)
  
  result.g[i] <- sum(x.g * qq.g)/sum(qq.g)
}

plot(M,result.g,log = "x", main='importance sampling result Gaussian')
abline(h = truthE, col = 2)

Sampling from a narrow Gaussian distribution

q.r_narrow <- function(x){rnorm(x,0,1/2)}
q.d_narrow <- function(x){dnorm(x,0,1/2)}

par(mfrow=c(1,2))
plot(x,q.d_narrow(x),type="l",main='sampler narrow distribution Gaussian')
curve(p, from = -4,to = 4 ,col=2 ,  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

Question

What will affect the estimation accuracy? - sample size \(n\) - others?

Number of samples for importance sampling (narrow Gaussian distribution)

M <- 10^seq(1,7,length.out = 30)

result.narrow <- numeric(length(M))
for(i in 1:length(M)){
  aM <- M[i]
  x.m <- q.r_narrow(aM)
  ww <- p(x.m) / q.d_narrow(x.m)
  
  qq.c <- ww / sum(ww)
  x.c <- phi(x.m)
  
  result.narrow[i] <- sum(x.c * qq.c)/sum(qq.c)
}
plot(M,result.narrow, log="x")
abline(h = truthE, col = 2)

Remarks for importance sampling

Want to estimate the \(\mathbb{E} (\phi (x) | p^* )\).
If the proposal density \(q^*(x)\) is small in a region where \(|\phi(x) p^*(x)|\) is large, it is quite possible that after many points \(x_m\) have been generated, none of them fell in that region. This leads to a wrong estimate of \(\mathbb{E} (\phi (x) | p^* )\).
Importance sampler should have heavy tails.
If \(q^* (x)\) can be chosen such that \(\frac{\phi p}{q^*}\) is roughly constant, then fairly precise estimates of the integral can be obtained.
Importance sampling is not a useful method if the importance ratios vary substantially. The worst possible scenario occurs when the importance ratios are small with high probability but with a low probability are huge.

Importance resampling (SIR)

SIR: sampling-importance resampling. (target 1, generate samples)
This is an alternative when rejection sampling constant \(c\) is not immediately available.
Algorithm (BDA3, reference)
- Draw samples \(x_1, \ldots, x_M \sim q^*(x)\).
- Calculated importance weights \(w_m = p(x_i)/q^*(x_i)\).
- Normalize the weights as \(W_m = \frac{w_m}{\sum_m w_m}\) (importance ratio).
- Resample (\(K\) out of \(M\)) from \(\{ x_1, \ldots, x_M \}\) where \(x_k, 1\le k \le K\) is drawn with probability \(W_m\). (without replacement)

Remark:

also see other people Resample (\(M\) out of \(M\)) with replacement.

Implement importance resampling (SIR)

#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l")

qstar <- function(x){rep.int(0.125,length(x))} ## proposal distribution: uniform distribution.
N <- 10000
S <- 1000
x.qstar <- runif( N, -4, 4 )
ww <- p(x.qstar) / qstar(x.qstar)
qq <- ww / sum(ww) ## importance ratio
x.acc <-sample(x.qstar, size = S, prob=qq, replace=F)
par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc,1))/length(x.acc))

Summarize

Direct sampling. (target 1)
Rejection sampling. (target 1)
Importance sampling. (target 2)
Sampling-importance resampling. (target 1)

limitation of Monte Carlo method

Direct sampling
- Often hard to compute the normalization factor \(Z\).
- Hard to get rare events, especially in higher dimensional spaces.
Rejection sampling, importance sampling.
- Do not work well if the proposed distribution \(q^*(x)\) is very different from \(p(x)\).
- Constructing a \(q^*(x)\) similar to \(p(x)\) can be difficult.
  - Making a good proposal usually requires knowledge of the analytic form of \(p(x)\) - but if we had that, we wouldn’t even need to sample!
Solution: instead of a fixed proposed distribution \(q^*(x)\), we can use an adaptive proposal.

Motivation of Metropolis-Hastings

Drawbacks of rejection sampling and SIR are that it is difficult to propose an efficient distribution \(q^*(x)\)
For rejection sampling, it is also difficult to find \(M\).
A smart idea is let the proposed distribution depends on the last accepted value.
Instead of fixed \(q(x')\), we use \(q(x'|x)\) where \(x'\) is the new state being sampled and \(x\) is the previous sample.
As \(x\) changes, \(q(x'|x)\) can also change (as a function of \(x'\).)

Comparison between importance sampling and Metropolis-Hastings

Monte Carlo vs Markov Chain Monte Carlo

Monte Carlo methods: simulation/sampling methods.
- Simulation
- Rejection sampling
- SIR
Markov Chain Monte Carlo: A type of Monte Carlo method – next step sample depend on previous sample.
- Metropolis-Hastings
- Gibbs Sampling

Introduce MH algorithm

Comparison between importance sampling and Metropolis-Hastings
Draw a sample \(x'\) from \(q(x'|x)\), where \(x\) is the previous sample.
The new sample \(x'\) is accepted or rejected with some probability \(A(x'|x)\). \[A(x'|x) = \min \bigg(1, \frac{p(x')q(x|x')}{p(x)q(x'|x)} \bigg)\]
\(\frac{p(x')}{q(x'|x)}/\frac{p(x)}{q(x|x')}\) is a ratio of importance sampling weights.

MH algorithm

Initialize our parameters \(x^0\).
Given an accepted value \(x^{t-1}\)
1. Draw \(x \sim q(\cdot|x^{t-1})\).
2. Accept \(x\) with probability \(A(x|x^{t-1}) = \min \bigg(1, \frac{p(x)q(x^{t-1}|x)}{p(x^{t-1})q(x|x^{t-1})} \bigg)\), and if accepted, set \(x^t = x\).
  - Draw \(u \sim UNIF(0,1)\).
  - If \(u < A(x|x^{t-1})\), accept.
3. If we didn’t accept \(x\), set \(x^t = x^{t - 1}\)
We repeat \(T\) times (\(t = 1, \ldots, T\)).

Implementation of MH algorithm

target distribution: \(p(x) = \exp [ 0.4(x-0.4)^2 - 0.08x^4 ]\)
sampling distribution: \(q(x^{(t)}) = dnorm(x^{(t)},x^{(t-1)},1)\)

p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.1)
plot(x,p(x),type="l")

N <- 10000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xp <- 0; ## Starting value
for (ii in 1:N){
  set.seed(ii)
  xc <- rnorm(1, mean=xp, sd=std) ## xp previous sample; xc: current sample
  alpha <- min(1, (
      p(xc) /dnorm(xc, mean=xp,sd=std) /
      (p(xp) / dnorm(xp, mean=xc,sd=std))
                  )
               )
  x.acc5[ii] <- xp <- ifelse(u[ii] < alpha, xc, xp)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N

## [1] 0.737

par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc5,1))/length(x.acc5))

Check samples from MH

plot(x.acc5,type="l")

Good convergence.
Samples are focused at the underlying distribution (stationary distribution)

Vary some parameters in MH

Burnin period: disgard intial samples since they may not be in the stationary distribution
- initial x = 8?
Play with variance? (doesn’t converge to the stationary distribution with limited sample size)
- sd = 0.1
- sd = 0.01

Trade off between acceptance rate and convergence.

How many samples (empirically):
- total 10,000 samples.
- 500 burnin samples.

Remove initial values for burnin period

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xp <- 8; ## Starting value
for (ii in 1:N){
  xc <- rnorm(1, mean=xp, sd=std) ## xp previous sample; xc: current sample
  alpha <- min(1, (
      p(xc) /dnorm(xc, mean=xp,sd=std) /
      (p(xp) / dnorm(xp, mean=xc,sd=std))
                  )
               )
  x.acc5[ii] <- xp <- ifelse(u[ii] < alpha, xc, xp)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N

## [1] 0.73

plot(x.acc5,type="l")

Doesn’t converge

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 0.1 ## Spread of proposal distribution
xc <- 0; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N

## [1] 0.971

plot(x.acc5,type="l")

Why MH converge?

Concepts:

Assume \(x\) follows \(p(x)\)
For any MCMC algorithm, the transition kernel is \(T(x'|x)\):
- probablity of sampling \(x'\) when previous step is \(x\).
Detailed balance condition:
\[ p(x') = \int_{x} p(x)T(x'|x) dx\]
If the transition kernel is \(T(x'|x)\) satisfies the detailed balance condition, the sampling procedure will recover the underlying distribution.

Why MH converge?

In MH we draw sample \(x'\) according to \(q(x'|x)\), then we accept/reject according to \(A(x'|x)\).
The transition kernel is \(T(x'|x) = q(x'|x) A(x'|x)\).
\(A(x'|x) = \min \bigg(1, \frac{p(x')q(x|x')}{p(x)q(x'|x)} \bigg)\)
Proof:
- If \(A(x'|x) \le 1\), then \(\frac{p(x')q(x|x')}{p(x)q(x'|x)} \le 1\), \(\frac{p(x)q(x'|x)}{p(x')q(x|x')} \ge 1\), \(A(x|x') = 1\).

\[ A(x'|x) = \frac{p(x')q(x|x')}{p(x)q(x'|x)}\] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') \] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') A(x|x')\] \[ p(x)T(x'|x) = p(x')T(x|x') \] The last line is called detailed balance condition.

\[ \int_{x} p(x)T(x'|x) dx = \int_{x} p(x')T(x|x') dx\] \[ p(x') = \int_{x} p(x)T(x'|x) dx\]

Since p(x) is the true distribution. MH algorithm will eventually converges to the true distribution.

Special cases for MH

Metropolis algorithm:
- The proposed distribution is symmetrical, e.g. \(q(x'|x) = q(x|x')\) for all pairs (x,x’). In this case the acceptance probability is \(A(x'|x) = \min (1, \frac{p(x')}{p(x)})\)
Random-walk Metropolis:
- A popular choice for proposal in a Metropolis algorithm is \(q(x'|x) = g(x-x')\) where g is symmetric.
Independence sampler:
- The proposed distribution \(q(x'|x) = q(x')\) doesn’t depend on \(x\). The acceptance probability becomes \(A(x'|x) = \min(1, \frac{p(x')}{p(x)} \frac{q(x)}{q(x')})\). This works well when \(q\) is a good approximation to \(p\).
Gibbs sampling:
- \(q(x'|x)\) is the conditional probability given all other variables. \(A(x'|x) = 1\).

Markov Chain Monte Carlo (MCMC)

We always need to pay attention to:
- Burn-in period
- Monitor convergence
Question: Since we know samples are correlated from Metropolis-Hastings algorithm, can we still estimate an estimator \(\theta\) by sample average?
- Yes, if the underlying density function is also about \(\theta\).
- Weak law of large number (WLLN) is still valid if samples are correlated, the only requirement is the samples are identical distributed.
Another popular algorithm: Gibbs sampling.