Biostatistical Computing, PHC 6068

Markov Chain Monte Carlo

Zhiguang Huo (Caleb)

Monday Nov 25, 2019

Bayesian sampling method

  1. Monte Carlo methods.
    • Direct sampling.
    • Rejection sampling.
    • Importance sampling.
    • Sampling-importance resampling.
  2. Markov chain Monte Carlo (MCMC).
    • Metropolis-Hasting.
    • Gibbs sampling.

Limitation of Monte Carlo method

Motivation of Metropolis-Hastings

Comparison between importance sampling and Metropolis-Hastings

Comparison between importance sampling and Metropolis-Hastings

Monte Carlo vs Markov Chain Monte Carlo

Introduce MH algorithm

MH algorithm

Implementation of MH algorithm

p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.1)
plot(x,p(x),type="l")

N <- 10000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xp <- 0; ## Starting value
for (ii in 1:N){
  set.seed(ii)
  xc <- rnorm(1, mean=xp, sd=std) ## xp previous sample; xc: current sample
  alpha <- min(1, (
      p(xc) /dnorm(xc, mean=xp,sd=std) /
      (p(xp) / dnorm(xp, mean=xc,sd=std))
                  )
               )
  x.acc5[ii] <- xp <- ifelse(u[ii] < alpha, xc, xp)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N
## [1] 0.7468
par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc5,1))/length(x.acc5))

Check samples from MH

plot(x.acc5,type="l")

Vary some parameters in MH

Trade off between acceptance rate and convergence.

Remove initial values for burnin period

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xp <- 8; ## Starting value
for (ii in 1:N){
  xc <- rnorm(1, mean=xp, sd=std) ## xp previous sample; xc: current sample
  alpha <- min(1, (
      p(xc) /dnorm(xc, mean=xp,sd=std) /
      (p(xp) / dnorm(xp, mean=xc,sd=std))
                  )
               )
  x.acc5[ii] <- xp <- ifelse(u[ii] < alpha, xc, xp)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N
## [1] 0.73
plot(x.acc5,type="l")

Doesn’t converge

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 0.1 ## Spread of proposal distribution
xc <- 0; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N
## [1] 0.971
plot(x.acc5,type="l")

Why MH converge?

Concepts:

Why MH converge?

\[ A(x'|x) = \frac{p(x')q(x|x')}{p(x)q(x'|x)}\] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') \] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') A(x|x')\] \[ p(x)T(x'|x) = p(x')T(x|x') \] The last line is called detailed balance condition.

\[ \int_{x} p(x)T(x'|x) dx = \int_{x} p(x')T(x|x') dx\] \[ p(x') = \int_{x} p(x)T(x'|x) dx\]

Since p(x) is the true distribution. MH algorithm will eventually converges to the true distribution.

Special cases for MH

Markov Chain Monte Carlo (MCMC)

Gibbs Sampling algorithm (Wikipedia)

Gibbs Sampling motivating example

Posterior distribution (via Bayes rule)

\[\begin{align*} P(I | G, S) &= \frac{P(I ,G, S)}{P(G, S)} \\ &= \frac{P(G, S|I)P(I)}{P(G, S|I)P(I) + P(G, S|-I)P(-I)} \\ &= \frac{P(G|I)P(S|I)P(I)}{P(G|I)P(S|I)P(I) + P(G|-I)P(S|-I)P(-I)} \\ &= \frac{0.8\times 0.7 \times 0.5}{0.8\times 0.7 \times 0.5 + 0.5\times 0.5 \times 0.5} \\ &= 0.69 \end{align*}\]

Similarly

Gibbs Sampling algorithm

Gibbs Sampling motivating example

  1. Initialize I,G,S.
  2. Pick an updating order. (e.g. I,G,S)
  3. Update each individual variable given all other variables.
Iteration I G S
init \(1\) \(1\) \(1\)
1
2
3
… … … …
K

Implementation in R

I <- 1; G <- 1; S <- 1
pG <- c(0.5, 0.8)
pS <- c(0.5, 0.7)
pI <- c(0.19, 0.36, 0.49,0.69) ## see below for justification
for(g in c(0,1)){
  for(s in c(0,1)){
    cat("G:" ,g ,", S:", s, ", p(I):", pI[2*g + s + 1], "\n")
  }
}
## G: 0 , S: 0 , p(I): 0.19 
## G: 0 , S: 1 , p(I): 0.36 
## G: 1 , S: 0 , p(I): 0.49 
## G: 1 , S: 1 , p(I): 0.69
i <- 1
plot(1:3,ylim=c(0,10),type="n", xaxt="n", ylab="iteration")
axis(1, at=1:3,labels=c("I", "G", "S"), col.axis="red")
text(x = 1:3, y= i, label = c(I, G, S))

set.seed(32611)
while(i<10){
  I <- rbinom(1,1,pI[2*G+S+1])
  G <- rbinom(1,1,pG[I+1])
  S <- rbinom(1,1,pS[I+1])
  i <- i + 1  
  text(x = 1:3, y= i, label = c(I, G, S))
}

Frequentist and Bayesian philosophy

An example of linear regression

Graphical model representation of the data generative process. Shaded nodes rerepsent abserved variables, dashed nodes represent priors.

Graphical model representation of the data generative process. Shaded nodes rerepsent abserved variables, dashed nodes represent priors.

Demonstrate the R code

Demonstrate the R code

set.seed(32611)
n = 100; alpha = 0; beta=2; sig2=0.5;true=c(alpha,beta,sig2)
x=rnorm(n)
y=rnorm(n,alpha+beta*x,sqrt(sig2))
# Prior hyperparameters
alpha0=0;tau2a=10;beta0=0;tau2b=10;nu0=3;mu0=3
# Setting up starting values
alpha=0;beta=0;sig2=1
# Gibbs sampler
M = 1000
draws = matrix(0,M,3)
draws[1,] <- c(alpha,beta,sig2)
for(i in 2:M){
  var_alpha = 1/(1/tau2a + n/sig2)
  mean_alpha = var_alpha*(sum(y-beta*x)/sig2 + alpha0/tau2a)
  alpha = rnorm(1,mean_alpha,sqrt(var_alpha))
  var_beta = 1/(1/tau2b + sum(x^2)/sig2)
  mean_beta = var_beta*(sum((y-alpha)*x)/sig2+beta0/tau2b)
  beta = rnorm(1,mean_beta,sqrt(var_beta))
  sig2 = 1/rgamma(1,(nu0+n/2),(mu0+sum((y-alpha-beta*x)^2)/2))
  draws[i,] = c(alpha,beta,sig2)
}

Check Gibbs sampling result

# Markov chain + marginal posterior
names = c('alpha','beta','sig2')
colnames(draws) <- names
ind = 101:M
par(mfrow=c(3,3))
for(i in 1:3){
  ts.plot(draws[,i],xlab='iterations',ylab="",main=names[i])
  abline(v=ind[1],col=4)
  abline(h=true[i],col=2,lwd=2)
  acf(draws[ind,i],main="")
  hist(draws[ind,i],prob=T,main="",xlab="")
  abline(v=true[i],col=2,lwd=2)
}

Posterior mean and median

colMeans(draws[101:M,])
##       alpha        beta        sig2 
## -0.06394806  1.93993716  0.54674891
apply(draws[101:M,],2,median)
##       alpha        beta        sig2 
## -0.06487311  1.93895984  0.53787251

Conjugate prior in Gibbs

MCMC tradeoff for number of iterations

Burnin period 1

A burnin example from my own research

Auto correlation

\[R_x(k) = \frac {\sum_{t=1}^{n-k} (x_t - \bar{x})(x_{t+k} - \bar{x})} {\sum_{t=1}^{n-k} (x_t - \bar{x})^2} \]

How to deal with autocorrelation?

High autocorrelation leads to smaller effective sample size.

Converge by chain mixing

Converge by likelihood

Why Gibbs is MH with \(A(x'|x) = 1\)?

\[\begin{align*} A(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} ) &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) q(x_i, \textbf{x}_{-i} | x'_i, \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) q(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) p(x'_i, |\textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i| \textbf{x}_{-i} ) p(\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i| \textbf{x}_{-i} ) p( \textbf{x}_{-i}) p(x'_i |\textbf{x}_{-i} )} \bigg) \\ &= \min (1,1)\\ &= 1 \end{align*}\]

References