Biostatistical Computing, PHC 6068

Monte Carlo methods

Zhiguang Huo (Caleb)

Monday Nov 26, 2017

Bayesian sampling method

Monte Carlo methods.
- Direct sampling.
- Rejection sampling.
- Importance sampling.
- Sampling-importance resampling.
Markov chain Monte Carlo (MCMC).
- Metropolis-Hasting.
- Gibbs sampling.

Notation

Denote \(p\), \(q\) as unnormalized density function.
- E.g. \(p(x) = x(1 - x)\), \(x \in (0, 1)\).
Denote \(p^*\), \(q^*\) as normalized density function.
- E.g. \(p^*(x) = \frac{p(x)}{\int_x p(x) dx} = 6x(1 - x)\), \(x \in (0, 1)\).
- E.g. \(q^*(x) = dnorm(x,0,1) = \frac{1}{\sqrt{2\pi}} \exp(-x^2)\).

Our targets

Target 1, To generate Monte Carlo samples \(x_m\) from a given probability distribution \(p^*(x)\) or \(p(x)\).
Target 2, To estimate expectations of functions under this distribution, for example \[\mathbb{E}(g (x) | p^*(x)) = \int g(x) p^*(x) dx,\]

Examples:

\(\mathbb{E} (x | p^*(x)) \doteq A\)
\(\mathbb{V}\mbox{ar} (x | p^*(x)) = \mathbb{E} [(x - A)^2| p^*(x)]\)

A simulation approach

Problem: We want to estimate \[\mathbb{E}(g (x) | p^*(x)) = \int g(x) p^*(x) dx,\] Given distribution \(p^*(x)\).

Examples: \(\mathbb{E} (x | p^*(x))\) or \(\mathbb{V}\mbox{ar} (x | p^*(x))\)

To generate samples \(x_m\) from a given probability distribution \(p^*(x)\).
- sample using R: (e.g. rnorm).
- CDF transformation: sample from UNIF(0,1). Then use inverse CDF transformation.
To estimate expectations of functions under this emperical distribution \[\mathbb{E}(g (x) | p^*(x)) \approx \frac{1}{M} \sum_{m=1}^M g(x^{(m)})\]
- \(M\) is number of Monte Carlo samples.
- Via law of large number, this is valid.

Un-normalized density function problems

Unnormalized density function: \(p(x) = \exp [ 0.4(x-0.4)^2 - 0.08x^4 ]\)
- \(x \in (-4, 4)\)

p <- function(x, a=.4, b=.08){
  res <- exp(a*(x-a)^2 - b*x^4)
  res[res < -4 | res > 4] <- 0
  res
}
x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

Hard to compute the normalization factor \(Z\).

integrate(f = p, lower = -4, upper = 4)

## 7.852178 with absolute error < 9.1e-06

Even if we know \(Z\), it is still challenge to draw samples.
Direct solution: partition the distribution into bins and direct sampling.

Direct Sampling

Partition the distribution into bins and perform direct sampling.

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

x2 <- seq(-4, 4, 0.1)
plot(x,p(x),type="n",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))
segments(x2,0,x2,p(x2))

Diffculty
- Higher-dimensional: say \(p = 5\)
- For each dimension, we divide the domain equally into \(50\) bins
- There will be \(50^{5}\) sampling space, huge!

Rejection sampling

\(p^*(x)\) is difficult to directly draw samples, but \(p(x)\) is easy to evaluate the function value.

(e.g. \(p(x) = \exp [ 0.4(x-0.4)^2 - 0.08x^4 ]\))

Sample from a simpler distribution \(q^*(x)\).
Rejection sampling algorithm:

\(x \sim q^*(x)\)
accept \(x\) with prob \(p(x)/c q^*(x)\):
- Sample \(u \sim \mbox{UNIF}(0,1)\), accept if \(p(x)/c q^*(x) > u\).
Repeat Step 1 and 2 many times.

Rejection sampling

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

x <- seq(-4, 4, 0.01)
qstar <- function(x, C = 30){
  C*dnorm(x,sd = 3) 
}
plot(x,p(x),type="l", ylim = c(0,5))
curve(qstar,add = T)
text(0, 5, expression({q^"*"} (x) == N (x , 0, 3^2) ))
text(0, 4.5, expression({cq^"*"} (x) == 30* N (x , 0, 3^2) ))
text(1, 2, expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))
x0 <- -2.5
segments(x0,0,x0,qstar(x0),col=2)
N <- 10
for(i in 1:N){
  set.seed(i)
  ay <- runif(1,0,qstar(x0))
  acol = ifelse(ay < p(x0),2,4)
  points(x0,ay,col=acol,pch=19)
}

Rejection sampling

Proof: \[\begin{align*} p^*(x) &= \frac{p(x)}{Z} \\ &= \frac{p(x)}{\int_x p(x) dx} \\ &= \frac{[p(x)/c q^*(x)]q^*(x)}{\int_x [p(x)/c q^*(x)]q^*(x)dx} \\ \end{align*}\]

Interpretation of the numerator:

\(q^*(x):\) Sampling from the proposed distribution.
\(p(x)/c q^*(x):\) Rejection probability.

Rejection sampling

## rejection sampling
#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.1)
qstar <- function(x){
  dnorm(x,sd = 3) 
}
# we can find M in this case:
C <- round(max(p(x)/qstar(x))) + 1; C

## [1] 28

# number of samples
N <- 1000
# generate proposals and u
x.h <- rnorm( N, sd = 3)
u <- runif( N )
acc <- u < p(x.h) / (C * qstar(x.h))
x.acc <- x.h[ acc ]
# how many proposals are accepted
sum( acc ) /N

## [1] 0.285

# calculate some statistics
c(m=mean(x.acc), s=sd(x.acc))

##          m          s 
## -0.6207873  1.4258200

par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc,1))/length(x.acc))

Discussion: What does the acceptance rate depend on?

Importance sampling

Importance sampling is not a method for generating samples from \(p(x)\) (target 1), it is just a method for estimating the expectation of a function \(g(x)\) (target 2).

Goal: want to calculate expectation of \(\phi(x)\) under \(p^*(x)\).
Sampling from \(p^*(x)\) is hard.
Suppose we can sample from a simpler proposal distribution \(q^*\) instead.
If \(q^*\) dominates \(p^*\) (i.e., \(q^*(x)>0\) whenever \(p^*(x)>0\)), we can sample from \(q^*\) and reweight: \(w(x) = \frac{p^*(x)}{q^*(x)}\)

Importance sampling, algorithm

Underlying distribution: \(p(x)\) or \(p^*(x) = \frac{p(x)}{Z}\).
Function of interest \(\phi (x)\).

\[\mathbb{E} (\phi (x) | p^* ) = \int \phi (x) p^*(x) dx\]

If we can sample \(x_m\) from \(p^*(x)\), when we can use \(\frac{1}{M} \phi(x_m)\) to estimate \(\mathbb{E} (\phi (x) | p^* )\)
If we cannot sample \(x_m\) from \(p^*(x)\),
- Rely on a proposed distribution (Sampler): \(q^*(x)\).

\[\begin{align*} \mathbb{E} (\phi (x) | p^* ) &= \int \phi (x) p^*(x) dx\\ &= \frac{\int \phi (x) p^*(x) dx}{\int p^*(x) dx}\\ &= \frac{\int [\phi (x) p(x)/Z] dx}{\int [p(x)/Z] dx}\\ &= \frac{\int [\phi (x) p(x)/q^*(x)] q^*(x) dx}{\int [p(x)/q^*(x)] q^*(x) dx}, \end{align*}\]

\(\mathbb{E} (\phi (x) | p^* )\) can be estimated using M draws \(x_1, \ldots, x_M\) from \(q^*(x)\) by the following expression.

\[\hat{\mathbb{E}} (\phi (x) | p^* ) = \frac{\frac{1}{M} \sum_{m=1}^M[\phi (x_m) p(x_m)/q^*(x_m)] }{ \frac{1}{M} \sum_{m=1}^M[p(x_m)/q^*(x_m)]}\]

\(w(x_m) =\frac{p(x_m)}{q^*(x_m)}\)

\[\hat{\mathbb{E}} (\phi (x) | p^* ) = \frac{ \sum_{m=1}^M \phi(x_m) w(x_m)}{ \sum_{m=1}^M w(x_m)} \]

Importance ratio: \(\frac{w(x_m)}{ \sum_{m=1}^M w(x_m)}\)
when \(q^* = p\), the regular mean estimator is a special case of the importance sampling.

Importance sampling, examples

Problem setting:

par(mfrow=c(1,2), mar=c(2,2,2,1))

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

phi <- function(x){ (- 1/3*x^3 + 1/2*x^2 + 12*x - 12) / 30 + 1.3}
x <- seq(-4, 4, 0.01)
plot(x,phi(x),type="l",main= expression(phi(x)))

\(p(x) = \exp [0.4 (x - 0.4) ^ 2 - 0.08 x^4]\)
\(\phi (x) = (- 1/3x^3 + 1/2x^2 + 12x - 12) / 30 + 1.3\) = right panel.

Underlying solution

ep <- function(x) p(x)*phi(x)
truthE <- integrate(f = ep, lower = -4, upper = 4)$value/integrate(f = p, lower = -4, upper = 4)$value
truthE

## [1] 0.6971733

Importance sampling, examples (2)

q.r <- rnorm
q.d <- dnorm

par(mfrow=c(1,2))
plot(x,q.d(x),type="l",main='sampler distribution Gaussian')
curve(p, from = -4,to = 4 ,col=2 ,  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

M <- 1000
x.m <- q.r(M)
ww <- p(x.m) / q.d(x.m)
qq <- ww / sum(ww)
x.g <- phi(x.m)
sum(x.g * qq)

## [1] 0.7022795

Number of samples for importance sampling

M <- 10^seq(1,7,length.out = 30)

result.g <- numeric(length(M))
for(i in 1:length(M)){
  aM <- M[i]
  x.m <- q.r(aM)
  ww <- p(x.m) / q.d(x.m)
  
  qq.g <- ww / sum(ww)
  x.g <- phi(x.m)
  
  result.g[i] <- sum(x.g * qq.g)/sum(qq.g)
}

plot(log10(M),result.g,main='importance sampling result Gaussian')
abline(h = truthE, col = 2)

Sampling from a narrow Gaussian distribution

q.r_narrow <- function(x){rnorm(x,0,1/2)}
q.d_narrow <- function(x){dnorm(x,0,1/2)}

par(mfrow=c(1,2))
plot(x,q.d_narrow(x),type="l",main='sampler narrow distribution Gaussian')
curve(p, from = -4,to = 4 ,col=2 ,  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

Number of samples for importance sampling (narrow Gaussian distribution)

M <- 10^seq(1,7,length.out = 30)

result.narrow <- numeric(length(M))
for(i in 1:length(M)){
  aM <- M[i]
  x.m <- q.r_narrow(aM)
  ww <- p(x.m) / q.d_narrow(x.m)
  
  qq.c <- ww / sum(ww)
  x.c <- phi(x.m)
  
  result.narrow[i] <- sum(x.c * qq.c)/sum(qq.c)
}
plot(log(M,10),result.narrow)
abline(h = truthE, col = 2)

Remarks for importance sampling

Want to estimate the \(\mathbb{E} (\phi (x) | p^* )\).
If the proposal density \(q^*(x)\) is small in a region where \(|\phi(x) p^*(x)|\) is large, it is quite possible that after many points \(x_m\) have been generated, none of them fell in that region. This leads to a wrong estimate of \(\mathbb{E} (\phi (x) | p^* )\).
Importance sampler should have heavy tails.
If \(q^* (x)\) can be chosen such that \(\frac{\phi p}{q^*}\) is roughly constant, then fairly precise estimates of the integral can be obtained.
Importance sampling is not a useful method if the importance ratios vary substantially. The worst possible scenario occurs when the importance ratios are small with high probability but with a low probability are huge.

Importance resampling (SIR)

SIR: sampling-importance resampling. (target 1, generate samples)
This is an alternative when rejection sampling constant \(c\) is not immediately available.
Algorithm (BDA3, reference)
- Draw samples \(x_1, \ldots, x_M \sim q^*(x)\).
- Calculated importance weights \(w_m = p(x_i)/q^*(x_i)\).
- Normalize the weights as \(W_m = \frac{w_m}{\sum_m w_m}\) (importance ratio).
- Resample (\(K\) out of \(M\)) from \(\{ x_1, \ldots, x_M \}\) where \(y_k, 1\le k \le K\) is drawn with probability \(W_m\). (without replacement)

Remark:

also see other people Resample (\(M\) out of \(M\)) with replacement.

Implement importance resampling (SIR)

#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l")

qstar <- function(x){rep.int(0.125,length(x))}
N <- 10000
S <- 1000
x.qstar <- runif( N, -4, 4 )
ww <- p(x.qstar) / qstar(x.qstar)
qq <- ww / sum(ww)
x.acc <-sample(x.qstar, size = S, prob=qq, replace=F)
par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc,1))/length(x.acc))

Summarize

Direct sampling. (target 1)
Rejection sampling. (target 1)
Importance sampling. (target 2)
Sampling-importance resampling. (target 1)

limitation of Monte Carlo method

Direct sampling
- Often hard to compute the normalization factor \(Z\).
- Hard to get rare events, especially in higher dimensional spaces.
Rejection sampling, importance sampling.
- Do not work well if the proposed distribution \(q^*(x)\) is very different from \(p(x)\).
- Constructing a \(q^*(x)\) similar to \(p(x)\) can be difficult.
  - Making a good proposal usually requires knowledge of the analytic form of \(p(x)\) - but if we had that, we wouldn’t even need to sample!
Solution: instead of a fixed proposed distribution \(q^*(x)\), we can use an adaptive proposal.

Motivation of Metropolis-Hastings

Drawbacks of rejection sampling and SIR are that it is difficult to propose an efficient distribution \(q^*(x)\)
For rejection sampling, it is also difficult to find \(M\).
A smart idea is let the proposed distribution depends on the last accepted value.
Instead of fixed \(q(x')\), we use \(q(x'|x)\) where \(x'\) is the new state being sampled and \(x\) is the previous sample.
As \(x\) changes, \(q(x'|x)\) can also change (as a function of \(x'\).)

Comparison between importance sampling and Metropolis-Hastings

Monte Carlo vs Markov Chain Monte Carlo

Monte Carlo methods: simulation/sampling methods.
- Simulation
- Rejection sampling
- SIR
Markov Chain Monte Carlo: A type of Monte Carlo method – next step samples depend on previous samples.
- Metropolis-Hastings
- Gibbs Sampling

Introduce MH algorithm

Comparison between importance sampling and Metropolis-Hastings
Draw a sample \(x'\) from \(q(x'|x)\), where \(x\) is the previous sample.
The new sample \(x'\) is accepted or rejected with some probability \(A(x'|x)\). \[A(x'|x) = \min \bigg(1, \frac{p(x')q(x|x')}{p(x)q(x'|x)} \bigg)\]
\(\frac{p(x')}{q(x'|x)}/\frac{p(x)}{q(x|x')}\) is a ratio of importance sampling weights.

MH algorithm

Initialize our parameters \(x^0\).
Given an accepted value \(x^{t-1}\)
1. Draw \(x \sim q(\cdot|x^{t-1})\).
2. Accept \(x\) with probability \(A(x|x^{t-1}) = \min \bigg(1, \frac{p(x)q(x^{t-1}|x)}{p(x^{t-1})q(x|x^{t-1})} \bigg)\), and if accepted, set \(x^t = x\).
  - Draw \(u \sim UNIF(0,1)\).
  - If \(u < A(x|x^{t-1})\), accept.
3. If we didn’t accept \(x\), set \(x^t = x^{t - 1}\)
We repeat \(T\) times (\(t = 1, \ldots, T\)).

Implementation of MH algorithm

target distribution: \(p(x) = \exp [ 0.4(x-0.4)^2 - 0.08x^4 ]\)
sampling distribution: \(q(x^{(t)}) = dnorm(x^{(t)},x^{(t-1)},1)\)

#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.1)
plot(x,p(x),type="l")

N <- 10000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xc <- 0; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N

## [1] 0.7341

par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc5,1))/length(x.acc5))

Check samples from MH

plot(x.acc5,type="l")

Good convergence.

Check samples from MH

burnin period: disgard intial samples since they may not be in the stationary distribution
- initial x = 8?
play with variance? (doesn’t converge)
- sd = 0.1
- sd = 0.01
How many samples (empirically):
- total 10,000 samples.
- 500 burnin samples.

Remove initial values for burnin period

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xc <- 8; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N

## [1] 0.732

plot(x.acc5,type="l")

Doesn’t converge

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 0.1 ## Spread of proposal distribution
xc <- 0; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N

## [1] 0.974

plot(x.acc5,type="l")

Why MH converge? (optional)

In MH we draw sample \(x'\) according to \(q(x'|x)\), then we accept/reject according to \(A(x'|x)\).
The transition kernel is \(T(x'|x) = q(x'|x) A(x'|x)\).
\(A(x'|x) = \min \bigg(1, \frac{p(x')q(x|x')}{p(x)q(x'|x)} \bigg)\)
Proof:
- If \(A(x'|x) \le 1\), then \(\frac{p(x')q(x|x')}{p(x)q(x'|x)} \le 1\), \(\frac{p(x)q(x'|x)}{p(x')q(x|x')} \ge 1\), \(A(x|x') = 1\).

\[ A(x'|x) = \frac{p(x')q(x|x')}{p(x)q(x'|x)}\] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') \] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') A(x|x')\] \[ p(x)T(x'|x) = p(x')T(x|x') \] The last line is called detailed balance condition.

\[ \int_{x} p(x)T(x'|x) dx = \int_{x} p(x')T(x|x') dx\] \[ p(x') = \int_{x} p(x)T(x'|x) dx\]

Since p(x) is the true distribution. MH algorithm will eventually converges to the true distribution.

Special cases for MH

Metropolis algorithm:
- The proposed distribution is symmetrical, e.g. \(q(x'|x) = q(x|x')\) for all pairs (x,x’). In this case the acceptance probability is \(A(x'|x) = \min (1, \frac{p(x')}{p(x)})\)
Random-walk Metropolis:
- A popular choice for proposal in a Metropolis algorithm is \(q(x'|x) = g(x-x')\) where g is symmetric.
Independence sampler:
- The proposed distribution \(q(x'|x) = q(x')\) doesn’t depend on \(x\). The acceptance probability becomes \(A(x'|x) = \min(1, \frac{p(x')}{p(x)} \frac{q(x)}{q(x')})\). This works well when \(q\) is a good approximation to \(p\).
Gibbs sampling:
- \(q(x'|x)\) is the conditional probability given all other variables. \(A(x'|x) = 1\).

Markov Chain Monte Carlo (MCMC)

Metropolis-Hastings algorithm
- Burn-in period
- Monitor convergence
Question: Since we know samples are correlated from Metropolis-Hastings algorithm, can we still estimate an estimator \(\theta\) by sample average?
- Yes, if the underlying density function is also about \(\theta\).
- Weak law of large number (WLLN) is still valid if samples are correlated, the only requirement is the samples are identical distributed.
Another popular algorithm: Gibbs sampling.

Gibbs Sampling algorithm (Wikipedia)

Gibbs sampling is named after the physicist Josiah Willard Gibbs, in reference to an analogy between the sampling algorithm and statistical physics.
The algorithm was described by brothers Stuart and Donald Geman in 1984, some eight decades after the death of Gibbs.
Josiah Willard Gibbs (February 11, 1839 - April 28, 1903) was an American scientist who made important theoretical contributions to physics, chemistry, and mathematics.

Gibbs Sampling motivating example

Prior
- \(P(I) = 0.5\), \(P(-I) = 0.5\)
Likelihood (Generative process) for \(G\):
- \(P(G|I) = 0.8\), \(P(-G|I) = 0.2\)
- \(P(G|-I) = 0.5\), \(P(-G|-I) = 0.5\)
Likelihood (Generative process) for \(S\):
- \(P(S|I) = 0.7\), \(P(-S|I) = 0.3\)
- \(P(S|-I) = 0.5\), \(P(-S|-I) = 0.5\)

Posterior distribution (via Bayes rule)

\[\begin{align*} P(I | G, S) &= \frac{P(I ,G, S)}{P(G, S)} \\ &= \frac{P(G, S|I)P(I)}{P(G, S|I)P(I) + P(G, S|-I)P(-I)} \\ &= \frac{P(G|I)P(S|I)P(I)}{P(G|I)P(S|I)P(I) + P(G|-I)P(S|-I)P(-I)} \\ &= \frac{0.8\times 0.7 \times 0.5}{0.8\times 0.7 \times 0.5 + 0.5\times 0.5 \times 0.5} \\ &= 0.69 \end{align*}\]

Similarly

\(P(I | G, -S) = 0.49\)
\(P(I | -G, S) = 0.36\)
\(P(I | -G, -S) = 0.19\)

Gibbs Sampling algorithm

Suppose the graphical model contains variable \(\theta_1, \ldots, \theta_p\).
Initialize starting values for \(\theta_1, \ldots, \theta_p\).
Do until convergence:
- Pick an ordering of the \(p\) variables (can be fixed or random).
- For each variable \(\theta_i\) in order:
  - Sample \(\theta\) from \(P(\theta_i| \theta_1, \ldots,\theta_{i-1}, \theta_{i+1},\ldots \theta_p,X)\), the conditional distribution of \(\theta_i\) given the current values of all other variables.
  - Update \(\theta_i \leftarrow \theta\)

Gibbs Sampling motivating example

Initialize I,G,S.
Pick an updating order. (e.g. I,G,S)
Update each individual variable given all other variables.

Iteration	I	G	S
init	\(1\)	\(1\)	\(1\)
1
2
3
…	…	…	…
K

Implementation in R

I <- 1; G <- 1; S <- 1
pG <- c(0.5, 0.8)
pS <- c(0.5, 0.7)
pI <- c(0.19, 0.36, 0.49,0.69)

i <- 1
plot(1:3,ylim=c(0,10),type="n", xaxt="n", ylab="iteration")
axis(1, at=1:3,labels=c("I", "G", "S"), col.axis="red")
text(x = 1:3, y= i, label = c(I, G, S))

set.seed(32611)
while(i<10){
  I <- rbinom(1,1,pI[2*G+S+1])
  G <- rbinom(1,1,pG[I+1])
  S <- rbinom(1,1,pS[I+1])
  i <- i + 1  
  text(x = 1:3, y= i, label = c(I, G, S))
}

Frequentist and Bayesian philosophy

Frequentist:
- parameters \(\theta\) are fixed
- data \(X\) are random variables
- Goal: create procedures with frequency guarantees
Bayesian:
- parameters \(\theta\) are random variables
- data \(X\) are random variables
- analyze beliefs

An example of linear regression

Underlying truth: \(n=100\), \(\alpha=0\), \(\beta=2\), \(\sigma^2=0.5\).
Simulate iid samples: \(x_i \sim N(0,1)\), \(y_i \sim N(\alpha + \beta x_i,\sigma^2)\). (\(i = 1, \ldots, 100\))
Purpose: want to infer \(\alpha\), \(\beta\), \(\sigma^2\).
Set up priors: \(\alpha \sim N(\alpha_0, \tau_a^2)\), \(\beta \sim N(\beta_0, \tau_b^2)\), \(\sigma^2 \sim IG(\nu, \mu)\). \(\alpha_0 = 0\), \(\beta_0 = 0\), \(\tau_a^2 = 10\), \(\tau_b^2 = 10\), \(\nu=3\), \(\mu=3\).

Graphical model representation of the data generative process. Shaded nodes rerepsent abserved variables, dashed nodes represent priors.

Posterior can be derived using Bayes rule:
- \(var_\alpha \doteq 1/(1/\tau_a^2 + n/\sigma^2)\), \(var_\beta \doteq 1/(1/\tau_b^2 + \sum_{i=1}^n x_i^2/\sigma^2)\)
- \(\alpha | x_1^n, y_1^n, \beta, \sigma^2 \sim N(var_\alpha(\sum_{i=1}^n (y_i - \beta x_i)/\sigma^2 + \alpha_0 / \tau_a^2), var_\alpha)\)
- \(\beta | x_1^n, y_1^n, \alpha, \sigma^2 \sim N(var_\beta(\sum_{i=1}^n \big( (y_i - \alpha) x_i \big) /\sigma^2 + \beta_0 / \tau_b^2), var_\beta)\)
- \(\sigma^2 | x_1^n, y_1^n, \alpha, \beta \sim IG(\nu + n/2, \mu + \sum_{i=1}^n (y_i - \alpha - \beta x_i)^2/2)\)

Demonstrate the R code

MCMC simulation 1,000 times.
Visualization of MCMC chain
Remove burning 100.
Distribution of parameter.
Auto correlation. (ACF)
Effective samples.
Posterior mean.
chain mixing.

Demonstrate the R code

set.seed(32611)
n = 100; alpha = 0; beta=2; sig2=0.5;true=c(alpha,beta,sig2)
x=rnorm(n)
y=rnorm(n,alpha+beta*x,sqrt(sig2))
# Prior hyperparameters
alpha0=0;tau2a=10;beta0=0;tau2b=10;nu0=3;mu0=3
# Setting up starting values
alpha=0;beta=0;sig2=1
# Gibbs sampler
M = 1000
draws = matrix(0,M,3)
draws[1,] <- c(alpha,beta,sig2)
for(i in 2:M){
  var_alpha = 1/(1/tau2a + n/sig2)
  mean = var_alpha*(sum(y-beta*x)/sig2 + alpha0/tau2a)
  alpha = rnorm(1,mean,sqrt(var_alpha))
  var_beta = 1/(1/tau2b + sum(x^2)/sig2)
  mean = var_beta*(sum((y-alpha)*x)/sig2+beta0/tau2b)
  beta = rnorm(1,mean,sqrt(var_beta))
  sig2 = 1/rgamma(1,(nu0+n/2),(mu0+sum((y-alpha-beta*x)^2)/2))
  draws[i,] = c(alpha,beta,sig2)
}

Check Gibbs sampling result

# Markov chain + marginal posterior
names = c('alpha','beta','sig2')
colnames(draws) <- names
ind = 101:M
par(mfrow=c(3,3))
for(i in 1:3){
  ts.plot(draws[,i],xlab='iterations',ylab="",main=names[i])
  abline(v=ind[1],col=4)
  abline(h=true[i],col=2,lwd=2)
  acf(draws[ind,i],main="")
  hist(draws[ind,i],prob=T,main="",xlab="")
  abline(v=true[i],col=2,lwd=2)
}

Posterior mean

colMeans(draws[101:M,])

##       alpha        beta        sig2 
## -0.06394806  1.93993716  0.54674891

MCMC tradeoff for number of iterations

Large number of iterations will tend to recover the underlying distribution in higher resolution.
Large number of iterations will also add computational burden.

Burnin period 1

If the initial value is within the range of stationary distribution, we won’t need burnin.
If the initial value is out the range of stationary distribution, we need need to discard them.

A burnin example from my own research

Auto correlation

MCMC chains always show autocorrelation (AC).
AC means that adjacent samples in time are highly correlated.

\[R_x(k) = \frac {\sum_{t=1}^{n-k} (x_t - \bar{x})(x_{t+k} - \bar{x})} {\sum_{t=1}^{n-k} (x_t - \bar{x})^2} \]

The first-order AC \(R_x(1)\) can be used to estimate the Sample Size Inflation Factor (SSIF): \[ s_x = \frac{1 + R_x(1)}{1 - R_x(1)} \]
If we took n samples with SSIF, then the effective sample size is \(n/s_x\)

How to deal with autocorrelation?

High autocorrelation leads to smaller effective sample size.

Design smarter algorithm to make auto-correlation smaller.
Thining: Only take samples every 10 iterations.
Just keep everything and make the Markov chain longer.

Converge by chain mixing

Monitor convergence by plotting samples (of r.v.s) from multiple chains (multiple MCMC runs).
- If the chains are well mixed (left), they are probably converged.
- If the chains are poorly-mixed (right), we should continue burn-in.

Converge by likelihood

How to monitor the convergence of the Markov chain.
- Monitor the pattern of samples.
- Monitor likelihood.

Why Gibbs is MH with \(A(x'|x) = 1\)?

The Gibbs sampling proposal distribution is: \[q(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} ) = p(x'_i | \textbf{x}_{-i})\]
Applying MH to this proposed distribution:

\[\begin{align*} A(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} ) &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) q(x_i, \textbf{x}_{-i} | x'_i, \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) q(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) p(x'_i, |\textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i| \textbf{x}_{-i} ) p(\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i| \textbf{x}_{-i} ) p( \textbf{x}_{-i}) p(x'_i |\textbf{x}_{-i} )} \bigg) \\ &= \min (1,1)\\ &= 1 \end{align*}\]

Gibbs sampling is a MH with acceptance rate 100%.

Biostatistical Computing, PHC 6068

Monte Carlo methods

Bayesian sampling method

Notation

Our targets

A simulation approach

Un-normalized density function problems

Direct Sampling

Rejection sampling

Rejection sampling

Rejection sampling

Rejection sampling

Importance sampling

Importance sampling, algorithm

Importance sampling, examples

Underlying solution

Importance sampling, examples (2)

Number of samples for importance sampling

Sampling from a narrow Gaussian distribution

Number of samples for importance sampling (narrow Gaussian distribution)

Remarks for importance sampling

Importance resampling (SIR)

Implement importance resampling (SIR)

Summarize

limitation of Monte Carlo method

Motivation of Metropolis-Hastings

Monte Carlo vs Markov Chain Monte Carlo

Introduce MH algorithm

MH algorithm

Implementation of MH algorithm

Check samples from MH

Check samples from MH

Remove initial values for burnin period

Doesn’t converge

Why MH converge? (optional)

Special cases for MH

Markov Chain Monte Carlo (MCMC)

Gibbs Sampling algorithm (Wikipedia)

Gibbs Sampling motivating example

Posterior distribution (via Bayes rule)

Gibbs Sampling algorithm

Gibbs Sampling motivating example

Implementation in R

Frequentist and Bayesian philosophy

An example of linear regression

Demonstrate the R code

Demonstrate the R code

Check Gibbs sampling result

Posterior mean

MCMC tradeoff for number of iterations

Burnin period 1

A burnin example from my own research

Auto correlation

How to deal with autocorrelation?

Converge by chain mixing

Converge by likelihood

Why Gibbs is MH with \(A(x'|x) = 1\)?

References