Biostatistical Computing, PHC 6068

Markov Chain Monte Carlo

Zhiguang Huo (Caleb)

Wednesday Dec 2, 2020

Bayesian sampling method

  1. Monte Carlo methods.
    • Direct sampling.
    • Rejection sampling.
    • Importance sampling.
    • Sampling-importance resampling.
  2. Markov chain Monte Carlo (MCMC).
    • Metropolis-Hasting.
    • Gibbs sampling.

Gibbs Sampling algorithm (Wikipedia)

Gibbs Sampling motivating example

Posterior distribution (via Bayes rule)

\[\begin{align*} P(I | G, S) &= \frac{P(I ,G, S)}{P(G, S)} \\ &= \frac{P(G, S|I)P(I)}{P(G, S|I)P(I) + P(G, S|-I)P(-I)} \\ &= \frac{P(G|I)P(S|I)P(I)}{P(G|I)P(S|I)P(I) + P(G|-I)P(S|-I)P(-I)} \\ &= \frac{0.8\times 0.7 \times 0.5}{0.8\times 0.7 \times 0.5 + 0.5\times 0.5 \times 0.5} \\ &= 0.69 \end{align*}\]

Similarly

Gibbs Sampling algorithm

Gibbs Sampling motivating example

  1. Initialize I,G,S.
  2. Pick an updating order. (e.g. I,G,S)
  3. Update each individual variable given all other variables.
Iteration I G S
init \(1\) \(1\) \(1\)
1
2
3
… … … …
K

Implementation in R

I <- 1; G <- 1; S <- 1
pG <- c(0.5, 0.8)
pS <- c(0.5, 0.7)
pI <- c(0.19, 0.36, 0.49,0.69) ## see below for justification
for(g in c(0,1)){
  for(s in c(0,1)){
    cat("G:" ,g ,", S:", s, ", p(I):", pI[2*g + s + 1], "\n")
  }
}
## G: 0 , S: 0 , p(I): 0.19 
## G: 0 , S: 1 , p(I): 0.36 
## G: 1 , S: 0 , p(I): 0.49 
## G: 1 , S: 1 , p(I): 0.69
i <- 1
plot(1:3,ylim=c(0,10),type="n", xaxt="n", ylab="iteration")
axis(1, at=1:3,labels=c("I", "G", "S"), col.axis="red")
text(x = 1:3, y= i, label = c(I, G, S))

set.seed(32611)
while(i<10){
  I <- rbinom(1,1,pI[2*G+S+1])
  G <- rbinom(1,1,pG[I+1])
  S <- rbinom(1,1,pS[I+1])
  i <- i + 1  
  text(x = 1:3, y= i, label = c(I, G, S))
}

Frequentist and Bayesian philosophy

An example of linear regression

Graphical model representation of the data generative process. Shaded nodes rerepsent abserved variables, dashed nodes represent priors.

Graphical model representation of the data generative process. Shaded nodes rerepsent abserved variables, dashed nodes represent priors.

Demonstrate the R code

Demonstrate the R code

set.seed(32611)
n = 100; alpha = 0; beta=2; sig2=0.5;true=c(alpha,beta,sig2)
x=rnorm(n)
y=rnorm(n,alpha+beta*x,sqrt(sig2))
# Prior hyperparameters
alpha0=0;tau2a=10;beta0=0;tau2b=10;nu0=3;mu0=3
# Setting up starting values
alpha=0;beta=0;sig2=1
# Gibbs sampler
M = 1000
draws = matrix(0,M,3)
draws[1,] <- c(alpha,beta,sig2)
for(i in 2:M){
  var_alpha = 1/(1/tau2a + n/sig2)
  mean_alpha = var_alpha*(sum(y-beta*x)/sig2 + alpha0/tau2a)
  alpha = rnorm(1,mean_alpha,sqrt(var_alpha))
  var_beta = 1/(1/tau2b + sum(x^2)/sig2)
  mean_beta = var_beta*(sum((y-alpha)*x)/sig2+beta0/tau2b)
  beta = rnorm(1,mean_beta,sqrt(var_beta))
  sig2 = 1/rgamma(1,(nu0+n/2),(mu0+sum((y-alpha-beta*x)^2)/2))
  draws[i,] = c(alpha,beta,sig2)
}

Check Gibbs sampling result

# Markov chain + marginal posterior
names = c('alpha','beta','sig2')
colnames(draws) <- names
ind = 101:M
par(mfrow=c(3,3))
for(i in 1:3){
  ts.plot(draws[,i],xlab='iterations',ylab="",main=names[i])
  abline(v=ind[1],col=4)
  abline(h=true[i],col=2,lwd=2)
  acf(draws[ind,i],main="")
  hist(draws[ind,i],prob=T,main="",xlab="")
  abline(v=true[i],col=2,lwd=2)
}

Posterior mean and median

colMeans(draws[101:M,])
##       alpha        beta        sig2 
## -0.06394806  1.93993716  0.54674891
apply(draws[101:M,],2,median)
##       alpha        beta        sig2 
## -0.06487311  1.93895984  0.53787251

Conjugate prior in Gibbs

MCMC tradeoff for number of iterations

Burnin period 1

A burnin example from my own research

Auto correlation

\[R_x(k) = \frac {\sum_{t=1}^{n-k} (x_t - \bar{x})(x_{t+k} - \bar{x})} {\sum_{t=1}^{n-k} (x_t - \bar{x})^2} \]

How to deal with autocorrelation?

High autocorrelation leads to smaller effective sample size.

Converge by chain mixing

Converge by likelihood

Why Gibbs is MH with \(A(x'|x) = 1\)?

\[\begin{align*} A(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} ) &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) q(x_i, \textbf{x}_{-i} | x'_i, \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) q(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) p(x'_i, |\textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i| \textbf{x}_{-i} ) p(\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i| \textbf{x}_{-i} ) p( \textbf{x}_{-i}) p(x'_i |\textbf{x}_{-i} )} \bigg) \\ &= \min (1,1)\\ &= 1 \end{align*}\]

References