Biostatistical Computing, PHC 6068

Bootstrapping

Zhiguang Huo (Caleb)

Wednesday November 6, 2019

Outline

International Prize in Statistics Awarded to Bradley Efron, for his contribution in Bootstrapping (annouced in 11/12/2018)

Bootstrapping (Motivating example 1, variance of “mean estimator”)

\[\hat{Var}(\hat{\mu}) =\frac{1}{n} \hat{\sigma}^2 = \frac{1}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2\]

Bootstrapping (Motivating example 2, variance of “median estimator”)

Aim of bootstrapping: how to estimate variance

Variance of “median estimator”

Monte Carlo method

Variance of “median estimator”, Monte Carlo method

mu <- 1
sigma <- 1
n <- 100
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- rnorm(n, mu, sigma)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)
## [1] 0.014573

Emperical distribution

Empirical process (visualization)

library(ggplot2)
n <- 1000
df <- data.frame(x = c(rnorm(n, 0, 1)))
base <- ggplot(df, aes(x)) + stat_ecdf()
base + stat_function(fun = pnorm, colour = "red") + xlim(c(-3,3))
## Warning: Removed 2 rows containing non-finite values (stat_ecdf).

Empirical process

Empirical distribution is close to the underlying distribution

Variance of the “median estimator”, Bootstrapping method

Instead of drawing samples from the underlying distribution \(F \sim N(\mu, \sigma^2)\), we draw from the empirical distribution \(F_n\)

How to sample from the empirical distribution?

Variance of “median estimator”, Bootstrapping method

mu <- 1
sigma <- 1
n <- 100
set.seed(32611)
X <- rnorm(n, mu, sigma)
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- sample(X, replace = T)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)
## [1] 0.02598333

Bootstrapping Variance Estimator

  1. Draw a bootstrap sample \(X_1^*, \ldots, X_n^* \sim F_n\), where \(F_n\) is the emperical CDF. Compute \({T^*}_n = g(X_1^*, \ldots, X_n^*)\).
  2. Repeat the previous step \(B\) times, yielding estimators \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\).
  3. Compute \[\hat{Var}_{F_n}({T}_n) = \frac{1}{B-1}\sum_{b=1}^B ({T^*}_n^{(b)} - \bar{T}^*)^2,\] where \(\bar{T}^* = \frac{1}{B}\sum_{b=1}^B {T^*}_n^{(b)}\)
  4. Output \(\hat{Var}_{F_n}({T}_n)\) as the bootstrap variance of \({T}_n\).

Why Bootstrapping variance works?

Since in general, we don’t know distribution \(F\), we will calculate using the empirical CDF \(F_n\).

To summarize:

The parametric Bootstrapping (Variance of “median estimator”)

The parametric Bootstrapping (Variance of “median estimator”)

mu <- 1
sigma <- 1
n <- 100
set.seed(32611)
X <- rnorm(n, mu, sigma)
muhat <- mean(X)
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- rnorm(n,muhat,sigma)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)
## [1] 0.014573

Comparison study

Delta method

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
var_hat1 <- 4*mean(X)^3/n
print(var_hat1)
## [1] 4.97006

Bootstrapping (non-parametric)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
var_hat2 <- var(TB)
print(var_hat2)
## [1] 4.44549

Parametric Bootstrapping

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
lambdaHat <- mean(X)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- rpois(n, lambdaHat)
  TB[b] <- (mean(aX))^2
}
var_hat3 <- var(TB)
print(var_hat3)
## [1] 5.261807

Bootstrapping using R package

library(boot)
myMean <- function(data, indices){
  d <- data[indices]
  mean(d)^2
}

## non parametric bootstrap
set.seed(32611)
boot_nonpara <- boot(data=X, statistic = myMean, R = B)
var(boot_nonpara$t)
##          [,1]
## [1,] 4.495394
## parametric bootstrap
genPois <- function(data, lambda){
  rpois(length(data), lambda)
}
boot_para <- boot(data=X, statistic = myMean, R = B, sim="parametric", ran.gen = genPois, mle = mean(X))
var(boot_para$t)
##          [,1]
## [1,] 5.199379

Simulation

lambda <- 5
n <- 100
B <- 1000
Ts <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- rpois(n, lambda)
  Ts[b] <- (mean(aX))^2
}
print(var(Ts))
## [1] 5.28793

Summary

Summary

Method Need simulation? Need underlying parameter
Delta method N N
Non-parametric bootstrapping Y N
Parametric bootstrapping Y N
Simulation Y Y

Comparison between Simulation and Non-parametric bootstrapping

The rest of the percedures are the same

Bootstrap confidence interval

Bootstrapping confidence interval via Percentiles

  1. Draw a bootstrap sample \(X_1^*, \ldots, X_n^* \sim F_n\). Compute \({T^*}_n = g(X_1^*, \ldots, X_n^*)\).
  2. Repeat the previous step \(B\) times, yielding estimators \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\).
  3. Rank \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\) such that \({T^r}_n^{(1)} \le {T^r}_n^{(2)} \le \ldots \le {T^r}_n^{(B)}\)

We can define 95% confidence interval using (if B = 10,000) \[[{T^r}_n^{(250)}, {T^r}_n^{(9750)}]\]

Calculate Bootstrapping confidence interval via Percentiles (1)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
quantile(TB, c(0.025, 0.975))
##     2.5%    97.5% 
## 20.97411 29.16000

Performance of Bootstrapping confidence interval via Percentiles (2)

\[E(\bar{X}^2) = E(\bar{X})^2 + var(\bar{X}) = \lambda^2 + \lambda/n\]

lambda <- 5
n <- 100
truth <- lambda^2 + lambda/n
B <- 1000
Repeats <- 100

counts <- 0

plot(c(0,100),c(0,Repeats), type="n", xlab="boot CI", ylab="repeats index")
abline(v = truth, col=2)

for(r in 1:Repeats){
  set.seed(r)
  X <- rpois(n, lambda)
  TB <- numeric(B)
  for(b in 1:B){
    set.seed(b)
    aX <- sample(X,n,replace = T)
    TB[b] <- (mean(aX))^2
  }
  segments(quantile(TB, c(0.025)), r, quantile(TB, c(0.975)), r)
  if(quantile(TB, c(0.025)) < truth & truth < quantile(TB, c(0.975))){
    counts <- counts + 1
  }
}

counts/Repeats
## [1] 0.93

Calculation of Bootstrapping confidence interval via Percentiles (3)

library(boot)
myMean <- function(data, indices){
  d <- data[indices] ## in this example, data is a vector
  mean(d)^2
}
boot_nonpara <- boot(data=X, statistic = myMean, R = B)
boot.ci(boot_nonpara, type="perc")
## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_nonpara, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (22.56, 30.80 )  
## Calculations and Intervals on Original Scale

Normal approximation

\[[\hat{T}_n - Z_{1 - \alpha/2}\hat{\sigma}_B, \hat{T}_n - Z_{\alpha/2}\hat{\sigma}_B],\] Where \(Z_\alpha = \Phi^{-1}(1-\alpha)\), \(\Phi\) is the cdf of standard Normal distribution.

Where \(\hat{T}_n\) is the estimator from the original sample and \(\hat{\sigma}_B\) is bootstrap se.

Implementation for Normal approximation

lambda <- 5
n <- 100
B <- 1000

set.seed(32611)
X <- rpois(n, lambda)
lambdaHat <- mean(X)
That <- lambdaHat^2 
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
ci_l <- That - 1.96*sd(TB)
ci_u <- That + 1.96*sd(TB)

c(ci_l, ci_u)
## [1] 20.76757 29.03263

Evaluation for Normal approximation

lambda <- 5
n <- 100
truth <- lambda^2 
B <- 1000
Repeats <- 100

counts <- 0

plot(c(0,100),c(0,Repeats), type="n", xlab="boot CI", ylab="repeats index")
abline(v = truth, col=2)

for(r in 1:Repeats){
  set.seed(r)
  X <- rpois(n, lambda)
  lambdaHat <- mean(X)
  That <- lambdaHat^2 
  TB <- numeric(B)
  for(b in 1:B){
    set.seed(b)
    aX <- sample(X,n,replace = T)
    TB[b] <- (mean(aX))^2
  }
  ci_l <- That - 1.96*sd(TB)
  ci_u <- That + 1.96*sd(TB)
  segments(ci_l, r, ci_u, r)
  if(ci_l < truth & truth < ci_u){
    counts <- counts + 1
  }
}

counts/Repeats
## [1] 0.93

Bootstrapping confidence interval via Pivotal Intervals

Summary Bootstrapping confidence interval

Precedure Theoritical guarantee Fast R package Boot?
Percentiles No Yes Yes
Pivotal Intervals Yes No Yes
Pivotal Intervals (simplified, no se) Yes Yes No
normal approximation Yes Yes Yes

HW, Large scale Bootstrapping exercise

Reference