Biostatistical Computing, PHC 6068

Monte Carlo methods

Zhiguang Huo (Caleb)

Monday Nov 26, 2017

Bayesian sampling method

  1. Monte Carlo methods.
    • Direct sampling.
    • Rejection sampling.
    • Importance sampling.
    • Sampling-importance resampling.
  2. Markov chain Monte Carlo (MCMC).
    • Metropolis-Hasting.
    • Gibbs sampling.

Notation

Our targets

Examples:

A simulation approach

Problem: We want to estimate \[\mathbb{E}(g (x) | p^*(x)) = \int g(x) p^*(x) dx,\] Given distribution \(p^*(x)\).

Examples: \(\mathbb{E} (x | p^*(x))\) or \(\mathbb{V}\mbox{ar} (x | p^*(x))\)

Un-normalized density function problems

p <- function(x, a=.4, b=.08){
  res <- exp(a*(x-a)^2 - b*x^4)
  res[res < -4 | res > 4] <- 0
  res
}
x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

integrate(f = p, lower = -4, upper = 4)
## 7.852178 with absolute error < 9.1e-06

Direct Sampling

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

x2 <- seq(-4, 4, 0.1)
plot(x,p(x),type="n",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))
segments(x2,0,x2,p(x2))

Rejection sampling

(e.g. \(p(x) = \exp [ 0.4(x-0.4)^2 - 0.08x^4 ]\))

  1. \(x \sim q^*(x)\)
  2. accept \(x\) with prob \(p(x)/c q^*(x)\):
    • Sample \(u \sim \mbox{UNIF}(0,1)\), accept if \(p(x)/c q^*(x) > u\).
  3. Repeat Step 1 and 2 many times.

Rejection sampling

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

x <- seq(-4, 4, 0.01)
qstar <- function(x, C = 30){
  C*dnorm(x,sd = 3) 
}
plot(x,p(x),type="l", ylim = c(0,5))
curve(qstar,add = T)
text(0, 5, expression({q^"*"} (x) == N (x , 0, 3^2) ))
text(0, 4.5, expression({cq^"*"} (x) == 30* N (x , 0, 3^2) ))
text(1, 2, expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))
x0 <- -2.5
segments(x0,0,x0,qstar(x0),col=2)
N <- 10
for(i in 1:N){
  set.seed(i)
  ay <- runif(1,0,qstar(x0))
  acol = ifelse(ay < p(x0),2,4)
  points(x0,ay,col=acol,pch=19)
}

Rejection sampling

Proof: \[\begin{align*} p^*(x) &= \frac{p(x)}{Z} \\ &= \frac{p(x)}{\int_x p(x) dx} \\ &= \frac{[p(x)/c q^*(x)]q^*(x)}{\int_x [p(x)/c q^*(x)]q^*(x)dx} \\ \end{align*}\]

Interpretation of the numerator:

Rejection sampling

## rejection sampling
#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.1)
qstar <- function(x){
  dnorm(x,sd = 3) 
}
# we can find M in this case:
C <- round(max(p(x)/qstar(x))) + 1; C
## [1] 28
# number of samples
N <- 1000
# generate proposals and u
x.h <- rnorm( N, sd = 3)
u <- runif( N )
acc <- u < p(x.h) / (C * qstar(x.h))
x.acc <- x.h[ acc ]
# how many proposals are accepted
sum( acc ) /N
## [1] 0.285
# calculate some statistics
c(m=mean(x.acc), s=sd(x.acc))
##          m          s 
## -0.6207873  1.4258200
par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc,1))/length(x.acc))

Discussion: What does the acceptance rate depend on?

Importance sampling

Importance sampling is not a method for generating samples from \(p(x)\) (target 1), it is just a method for estimating the expectation of a function \(g(x)\) (target 2).

Importance sampling, algorithm

\[\mathbb{E} (\phi (x) | p^* ) = \int \phi (x) p^*(x) dx\]

\[\begin{align*} \mathbb{E} (\phi (x) | p^* ) &= \int \phi (x) p^*(x) dx\\ &= \frac{\int \phi (x) p^*(x) dx}{\int p^*(x) dx}\\ &= \frac{\int [\phi (x) p(x)/Z] dx}{\int [p(x)/Z] dx}\\ &= \frac{\int [\phi (x) p(x)/q^*(x)] q^*(x) dx}{\int [p(x)/q^*(x)] q^*(x) dx}, \end{align*}\]

\[\hat{\mathbb{E}} (\phi (x) | p^* ) = \frac{\frac{1}{M} \sum_{m=1}^M[\phi (x_m) p(x_m)/q^*(x_m)] }{ \frac{1}{M} \sum_{m=1}^M[p(x_m)/q^*(x_m)]}\]

\(w(x_m) =\frac{p(x_m)}{q^*(x_m)}\)

\[\hat{\mathbb{E}} (\phi (x) | p^* ) = \frac{ \sum_{m=1}^M \phi(x_m) w(x_m)}{ \sum_{m=1}^M w(x_m)} \]

Importance sampling, examples

par(mfrow=c(1,2), mar=c(2,2,2,1))

x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l",  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

phi <- function(x){ (- 1/3*x^3 + 1/2*x^2 + 12*x - 12) / 30 + 1.3}
x <- seq(-4, 4, 0.01)
plot(x,phi(x),type="l",main= expression(phi(x)))

Underlying solution

\[\begin{align*} \mathbb{E} (\phi (x) | p^* ) &= \int \phi (x) p^*(x) dx\\ &= \frac{\int \phi (x) p^*(x) dx}{\int p^*(x) dx}\\ &= \frac{\int [\phi (x) p(x)/Z] dx}{\int [p(x)/Z] dx}\\ &= \frac{\int [\phi (x) p(x)] dx}{\int [p(x)] dx}\\ \end{align*}\]
ep <- function(x) p(x)*phi(x)
truthE <- integrate(f = ep, lower = -4, upper = 4)$value/integrate(f = p, lower = -4, upper = 4)$value
truthE
## [1] 0.6971733

Importance sampling, examples (2)

q.r <- rnorm
q.d <- dnorm

par(mfrow=c(1,2))
plot(x,q.d(x),type="l",main='sampler distribution Gaussian')
curve(p, from = -4,to = 4 ,col=2 ,  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

M <- 1000
x.m <- q.r(M)
ww <- p(x.m) / q.d(x.m)
qq <- ww / sum(ww)
x.g <- phi(x.m)
sum(x.g * qq)
## [1] 0.7022795

Number of samples for importance sampling

M <- 10^seq(1,7,length.out = 30)

result.g <- numeric(length(M))
for(i in 1:length(M)){
  aM <- M[i]
  x.m <- q.r(aM)
  ww <- p(x.m) / q.d(x.m)
  
  qq.g <- ww / sum(ww)
  x.g <- phi(x.m)
  
  result.g[i] <- sum(x.g * qq.g)/sum(qq.g)
}

plot(log10(M),result.g,main='importance sampling result Gaussian')
abline(h = truthE, col = 2)

Sampling from a narrow Gaussian distribution

q.r_narrow <- function(x){rnorm(x,0,1/2)}
q.d_narrow <- function(x){dnorm(x,0,1/2)}

par(mfrow=c(1,2))
plot(x,q.d_narrow(x),type="l",main='sampler narrow distribution Gaussian')
curve(p, from = -4,to = 4 ,col=2 ,  main = expression(p(x) == exp (0.4(x-0.4)^{2}  - 0.08 * x^{4})))

Number of samples for importance sampling (narrow Gaussian distribution)

M <- 10^seq(1,7,length.out = 30)

result.narrow <- numeric(length(M))
for(i in 1:length(M)){
  aM <- M[i]
  x.m <- q.r_narrow(aM)
  ww <- p(x.m) / q.d_narrow(x.m)
  
  qq.c <- ww / sum(ww)
  x.c <- phi(x.m)
  
  result.narrow[i] <- sum(x.c * qq.c)/sum(qq.c)
}
plot(log(M,10),result.narrow)
abline(h = truthE, col = 2)

Remarks for importance sampling

Importance resampling (SIR)

Remark:

Implement importance resampling (SIR)

#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.01)
plot(x,p(x),type="l")

qstar <- function(x){rep.int(0.125,length(x))}
N <- 10000
S <- 1000
x.qstar <- runif( N, -4, 4 )
ww <- p(x.qstar) / qstar(x.qstar)
qq <- ww / sum(ww)
x.acc <-sample(x.qstar, size = S, prob=qq, replace=F)
par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc,1))/length(x.acc))

Summarize

limitation of Monte Carlo method

Motivation of Metropolis-Hastings

Comparison between importance sampling and Metropolis-Hastings

Comparison between importance sampling and Metropolis-Hastings

Monte Carlo vs Markov Chain Monte Carlo

Introduce MH algorithm

MH algorithm

Implementation of MH algorithm

#p <- function(x, a=.4, b=.08){exp(a*(x-a)^2 - b*x^4)}
x <- seq(-4, 4, 0.1)
plot(x,p(x),type="l")

N <- 10000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xc <- 0; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N
## [1] 0.7341
par(mfrow=c(1,2), mar=c(2,2,1,1))
plot(x,p(x),type="l")
barplot(table(round(x.acc5,1))/length(x.acc5))

Check samples from MH

plot(x.acc5,type="l")

Good convergence.

Check samples from MH

Remove initial values for burnin period

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 1 ## Spread of proposal distribution
xc <- 8; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N
## [1] 0.732
plot(x.acc5,type="l")

Doesn’t converge

N <- 1000
x.acc5 <- rep.int(NA, N)
u <- runif(N)
acc.count <- 0
std <- 0.1 ## Spread of proposal distribution
xc <- 0; ## Starting value
for (ii in 1:N){
  xp <- rnorm(1, mean=xc, sd=std) ## proposal
  alpha <- min(1, (p(xp)/p(xc)) *
                 (dnorm(xc, mean=xp,sd=std)/dnorm(xp, mean=xc,sd=std)))
  x.acc5[ii] <- xc <- ifelse(u[ii] < alpha, xp, xc)
  ## find number of acccepted proposals:
  acc.count <- acc.count + (u[ii] < alpha)
}
## Fraction of accepted *new* proposals
acc.count/N
## [1] 0.974
plot(x.acc5,type="l")

Why MH converge? (optional)

\[ A(x'|x) = \frac{p(x')q(x|x')}{p(x)q(x'|x)}\] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') \] \[ p(x)q(x'|x) A(x'|x) = p(x')q(x|x') A(x|x')\] \[ p(x)T(x'|x) = p(x')T(x|x') \] The last line is called detailed balance condition.

\[ \int_{x} p(x)T(x'|x) dx = \int_{x} p(x')T(x|x') dx\] \[ p(x') = \int_{x} p(x)T(x'|x) dx\]

Since p(x) is the true distribution. MH algorithm will eventually converges to the true distribution.

Special cases for MH

Markov Chain Monte Carlo (MCMC)

Gibbs Sampling algorithm (Wikipedia)

Gibbs Sampling motivating example

Posterior distribution (via Bayes rule)

\[\begin{align*} P(I | G, S) &= \frac{P(I ,G, S)}{P(G, S)} \\ &= \frac{P(G, S|I)P(I)}{P(G, S|I)P(I) + P(G, S|-I)P(-I)} \\ &= \frac{P(G|I)P(S|I)P(I)}{P(G|I)P(S|I)P(I) + P(G|-I)P(S|-I)P(-I)} \\ &= \frac{0.8\times 0.7 \times 0.5}{0.8\times 0.7 \times 0.5 + 0.5\times 0.5 \times 0.5} \\ &= 0.69 \end{align*}\]

Similarly

Gibbs Sampling algorithm

Gibbs Sampling motivating example

  1. Initialize I,G,S.
  2. Pick an updating order. (e.g. I,G,S)
  3. Update each individual variable given all other variables.
Iteration I G S
init \(1\) \(1\) \(1\)
1
2
3
… … … …
K

Implementation in R

I <- 1; G <- 1; S <- 1
pG <- c(0.5, 0.8)
pS <- c(0.5, 0.7)
pI <- c(0.19, 0.36, 0.49,0.69)

i <- 1
plot(1:3,ylim=c(0,10),type="n", xaxt="n", ylab="iteration")
axis(1, at=1:3,labels=c("I", "G", "S"), col.axis="red")
text(x = 1:3, y= i, label = c(I, G, S))

set.seed(32611)
while(i<10){
  I <- rbinom(1,1,pI[2*G+S+1])
  G <- rbinom(1,1,pG[I+1])
  S <- rbinom(1,1,pS[I+1])
  i <- i + 1  
  text(x = 1:3, y= i, label = c(I, G, S))
}

Frequentist and Bayesian philosophy

An example of linear regression

Graphical model representation of the data generative process. Shaded nodes rerepsent abserved variables, dashed nodes represent priors.

Graphical model representation of the data generative process. Shaded nodes rerepsent abserved variables, dashed nodes represent priors.

Demonstrate the R code

Demonstrate the R code

set.seed(32611)
n = 100; alpha = 0; beta=2; sig2=0.5;true=c(alpha,beta,sig2)
x=rnorm(n)
y=rnorm(n,alpha+beta*x,sqrt(sig2))
# Prior hyperparameters
alpha0=0;tau2a=10;beta0=0;tau2b=10;nu0=3;mu0=3
# Setting up starting values
alpha=0;beta=0;sig2=1
# Gibbs sampler
M = 1000
draws = matrix(0,M,3)
draws[1,] <- c(alpha,beta,sig2)
for(i in 2:M){
  var_alpha = 1/(1/tau2a + n/sig2)
  mean = var_alpha*(sum(y-beta*x)/sig2 + alpha0/tau2a)
  alpha = rnorm(1,mean,sqrt(var_alpha))
  var_beta = 1/(1/tau2b + sum(x^2)/sig2)
  mean = var_beta*(sum((y-alpha)*x)/sig2+beta0/tau2b)
  beta = rnorm(1,mean,sqrt(var_beta))
  sig2 = 1/rgamma(1,(nu0+n/2),(mu0+sum((y-alpha-beta*x)^2)/2))
  draws[i,] = c(alpha,beta,sig2)
}

Check Gibbs sampling result

# Markov chain + marginal posterior
names = c('alpha','beta','sig2')
colnames(draws) <- names
ind = 101:M
par(mfrow=c(3,3))
for(i in 1:3){
  ts.plot(draws[,i],xlab='iterations',ylab="",main=names[i])
  abline(v=ind[1],col=4)
  abline(h=true[i],col=2,lwd=2)
  acf(draws[ind,i],main="")
  hist(draws[ind,i],prob=T,main="",xlab="")
  abline(v=true[i],col=2,lwd=2)
}

Posterior mean

colMeans(draws[101:M,])
##       alpha        beta        sig2 
## -0.06394806  1.93993716  0.54674891

MCMC tradeoff for number of iterations

Burnin period 1

A burnin example from my own research

Auto correlation

\[R_x(k) = \frac {\sum_{t=1}^{n-k} (x_t - \bar{x})(x_{t+k} - \bar{x})} {\sum_{t=1}^{n-k} (x_t - \bar{x})^2} \]

How to deal with autocorrelation?

High autocorrelation leads to smaller effective sample size.

Converge by chain mixing

Converge by likelihood

Why Gibbs is MH with \(A(x'|x) = 1\)?

\[\begin{align*} A(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} ) &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) q(x_i, \textbf{x}_{-i} | x'_i, \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) q(x'_i, \textbf{x}_{-i} | x_i, \textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i,\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i,\textbf{x}_{-i}) p(x'_i, |\textbf{x}_{-i} )} \bigg) \\ &= \min \bigg( 1, \frac{p(x'_i| \textbf{x}_{-i} ) p(\textbf{x}_{-i}) p(x_i| \textbf{x}_{-i} ) } {p(x_i| \textbf{x}_{-i} ) p( \textbf{x}_{-i}) p(x'_i |\textbf{x}_{-i} )} \bigg) \\ &= \min (1,1)\\ &= 1 \end{align*}\]

References