Outline

Bootstrapping
- Bootstrapping variance, confidence interval
- non-parametric Bootstrapping, parametric Bootstrapping

International Prize in Statistics Awarded to Bradley Efron, for his contribution in Bootstrapping (annouced in 11/12/2018)

Bootstrapping (Motivating example 1, variance of “mean estimator”)

Given \(X_1, \ldots, X_n\) samples from \(N(\mu, \sigma^2)\). How to estimate the mean of this distribution, \(\hat{\mu}\)?
- Use \(\frac{1}{n} \sum_{i=1}^n X_i\), by weak law of large number: \[\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i \rightarrow E(X_i)\]
How to estimate the variance of \(\hat{\mu}\)?
- Similarly use weak law of large number. \[Var(\hat{\mu}) = Var(\frac{1}{n} \sum_{i=1}^n X_i) = \frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n} \sigma^2 \]

\[\hat{Var}(\hat{\mu}) =\frac{1}{n} \hat{\sigma}^2 = \frac{1}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2\]

Bootstrapping (Motivating example 2, variance of “median estimator”)

Given \(X_1, \ldots, X_n\) samples from \(N(\mu, \sigma_0^2)\). How to estimate the median of this distribution, \(T_n\)?
- Use \(T_n = median_{i=1}^n(X_i)\) be a statistic, where \(T_n\) is median of \(X_1, \ldots, X_n\).
How to estimate \(V_F(T_n)\), the variance of \(T_n\)?
- Seems difficult.

Aim of bootstrapping: how to estimate variance

Suppose \(X_1, \ldots, X_n \sim F\) where \(F\) is a distribution.
Let \(T_n = g(X_1, \ldots, X_n)\) be a statistic, where \(T_n\) is a function of the data.
What is the variance of \(T_n\), \(V_F(T_n)\)?

Variance of “median estimator”

Given \(X_1, \ldots, X_n\) samples from \(N(\mu_0, \sigma_0^2)\), where both \(\mu_0\) and \(\sigma_0^2\) are known. How to estimate the median of this distribution, \(T_n\)?
- Let \(T_n = median_{i=1}^n(X_i)\) be a statistic, where \(T_n\) is median of \(X_1, \ldots, X_n\).
How to estimate the variance of \(T_n\)?
- We can use Monte Carlo simulation method to estimate the variance of \(T_n\).

Monte Carlo method

For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim N(\mu_0,\sigma_0^2)\)
- Compute \(T^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{T}_n = \frac{1}{B} \sum_{b=1}^B T^{(b)}_n\)
\(\hat{V}_F (T_n) = \frac{1}{B - 1} \sum_{b=1}^B \{T^{(b)}_n - \bar{T}_n \}^2\)
By law of large number, \(\hat{V}_F (T_n) \rightarrow V_F(T_n)\)

Variance of “median estimator”, Monte Carlo method

mu <- 1
sigma <- 1
n <- 100
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- rnorm(n, mu, sigma)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)

## [1] 0.014573

What if we don’t know \(\mu\), and \(\sigma^2\)?

Emperical distribution

\(F = P(X \le x)\) is a distribution function.
We can estimate \(F\) with the empirical distribution function \(F_n\), the cdf that puts mass \(1/n\) at each data point \(X_i\). \[F_n (x) = \frac{1}{n} \sum_{i = 1}^n I(X_i \le x)\] where \[\begin{equation} I(X_i \le x) = \begin{cases} 1, & \text{if}\ X_i \le x \\ 0, & \text{if}\ X_i > x \end{cases} \end{equation}\]

Empirical process (visualization)

library(ggplot2)
n <- 1000
df <- data.frame(x = c(rnorm(n, 0, 1)))
base <- ggplot(df, aes(x)) + stat_ecdf()
base + stat_function(fun = pnorm, colour = "red") + xlim(c(-3,3))

## Warning: Removed 2 rows containing non-finite values (stat_ecdf).

Empirical process

Glivenko-Cantelli Theorem \[\sup_x | F_n(x) - F(x)| \rightarrow 0\]
Dvoretzky-Kiefer-Wolfowitz inequality, for any \(\varepsilon > 0\) \[P(\sup_x | F_n(x) - F(x)| > \varepsilon) \le 2\exp(-2n\varepsilon^2)\]

Variance of “median estimator”, Bootstrapping method

Instead of drawing samples from the underlying distribution \(F\) – \(N(\mu, \sigma^2)\), we draw from the empirical distribution \(F_n\)

For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim\) ~~N(mu,sigma^2)~~ \(F_n\)
- Compute \(T^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{T}_n = \frac{1}{B} \sum_{b=1}^B T^{(b)}_n\)
\(\hat{V}_{F_n} (T_n) = \frac{1}{B-1} \sum_{b=1}^B \{T^{(b)}_n - \bar{T}_n \}^2\)

How to sample from the empirical distribution?

Drawing \(X_1^*, \ldots, X_n^*\) from \(F_n\) is equivalent to draw \(n\) observations, with replacement from the original data \(\{X_1, \ldots, X_n\}\).
Therefore, Bootstrapping sampling is also described as resampling data.
For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim \{ X_1, \ldots, X_n \}\) with replacement.
- Compute \(T^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{T}_n = \frac{1}{B} \sum_{b=1}^B T^{(b)}_n\)
\(\hat{V}_{F_n} (T_n) = \frac{1}{B-1} \sum_{b=1}^B \{T^{(b)}_n - \bar{T}_n \}^2\)

Variance of “median estimator”, Bootstrapping method

mu <- 1
sigma <- 1
n <- 100
set.seed(32611)
X <- rnorm(n, mu, sigma)
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- sample(X, replace = T)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)

## [1] 0.02916533

Bootstrapping Variance Estimator

Draw a bootstrap sample \(X_1^*, \ldots, X_n^* \sim F_n\). Compute \({T^*}_n = g(X_1^*, \ldots, X_n^*)\).
Repeat the previous step \(B\) times, yielding estimators \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\).
Compute \[\hat{Var}_{F_n}({T}_n) = \frac{1}{B-1}\sum_{b=1}^B ({T^*}_n^{(b)} - \bar{T}^*)^2,\] where \(\bar{T}^* = \frac{1}{B}\sum_{b=1}^B {T^*}_n^{(b)}\)
Output \(\hat{Var}_{F_n}({T}_n)\) as the bootstrap variance of \({T}_n\), \(Var_{F}({T}_n)\).

Why Bootstrapping variance works?

\(T_n = g(X_1, \ldots, X_n)\)
\(mean_F(T_n) = \int g(X_1, \ldots, X_n) f(X) dX = \int g(X_1, \ldots, X_n) dF\)
\(Var_F(T_n) = \int (g(X_1, \ldots, X_n) - mean_F(T_n))^2 dF\)

Since in general, we don’t know distribution \(F\), we will calculate using the empirical CDF \(F_n\).

\(Var_{F_n}(T_n) = \int (g(X_1, \ldots, X_n) - mean_{F_n}(T_n))^2 dF_n\)
Finally, we used bootstrap variance \(\hat{Var}_{F_n}(T_n)\) to estimate \(Var_{F_n}(T_n)\).

To summarize:

Estimation error: \(Var_F(T_n) - Var_{F_n}(T_n) = O_p(1/\sqrt{n})\)
Simulation error: \(Var_{F_n}(T_n) - \hat{Var}_{F_n}(T_n) = O_p(1/\sqrt{B})\)

The parametric Bootstrapping (Variance of “median estimator”)

Calculate \(\hat{\mu} = \arg\max L(\mu;X_1^n)\), \(\sigma\) is known.
For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim N(\hat{\mu},\sigma^2)\) (\(\hat{F}\))
- Compute \(\hat{T}^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{\hat{T}}_n = \frac{1}{B} \sum_{b=1}^B \hat{T}^{(b)}_n\)
\(\hat{V}_F (T_n) = \frac{1}{B} \sum_{b=1}^B \{\hat{T}^{(b)}_n - \bar{\hat{T}}_n \}^2\)

The parametric Bootstrapping (Variance of “median estimator”)

mu <- 1
sigma <- 1
n <- 100
set.seed(32611)
X <- rnorm(n, mu, sigma)
muhat <- mean(X)
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- rnorm(n,muhat,sigma)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)

## [1] 0.014573

Comparison study: using delta method, non-parametric Bootstrapping, and parametric Bootstrapping

\(X_1, \ldots, X_n \sim POI(\lambda)\) (\(\lambda = 5\)), while we only observe the data \(X_1, \ldots, X_n\) and know data is from Poisson.
\(n = 100\)
\(T_n = (\frac{1}{n}\sum_{i=1}^nX_i)^2\)
What is the variance of \(T_n\)?

Delta method

\(\hat{\lambda} = \bar{X}\)
\(var(\bar{X}) = \frac{\lambda}{n}\)
\(var(T_n) = var(\bar{X}^2) = (2 \lambda)^2 \times var(\bar{X}) = \frac{4\lambda^3}{n}\)
\(\hat{var}_F(T_n) = \frac{4\hat{\lambda}^3}{n} = \frac{4\bar{X}^3}{n}\)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
var_hat1 <- 4*mean(X)^3/n
print(var_hat1)

## [1] 4.97006

Bootstrapping (non-parametric)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
var_hat2 <- var(TB)
print(var_hat2)

## [1] 4.389935

Parametric Bootstrapping

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
lambdaHat <- mean(X)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- rpois(n, lambdaHat)
  TB[b] <- (mean(aX))^2
}
var_hat3 <- var(TB)
print(var_hat3)

## [1] 5.261807

Bootstrapping using R package

library(boot)
myMean <- function(data, indices){
  d <- data[indices]
  mean(d)^2
}

## non parametric bootstrap
set.seed(32611)
boot_nonpara <- boot(data=X, statistic = myMean, R = B)
var(boot_nonpara$t)

##          [,1]
## [1,] 4.311527

## parametric bootstrap
genPois <- function(data, lambda){
  rpois(length(data), lambda)
}
boot_para <- boot(data=X, statistic = myMean, R = B, sim="parametric", ran.gen = genPois, mle = mean(X))
var(boot_para$t)

##          [,1]
## [1,] 5.147477

Bootstrap confidence interval

Percentiles
normal approximation
Pivotal Intervals

Bootstrapping confidence interval via Percentiles

Draw a bootstrap sample \(X_1^*, \ldots, X_n^* \sim F_n\). Compute \({T^*}_n = g(X_1^*, \ldots, X_n^*)\).
Repeat the previous step \(B\) times, yielding estimators \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\).
Rank \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\) such that \({T^r}_n^{(1)} \le {T^r}_n^{(2)} \le \ldots \le {T^r}_n^{(B)}\)

We can define 95% confidence interval using (if B = 10,000) \[[{T^r}_n^{(250)}, {T^r}_n^{(9750)}]\]

Calculate Bootstrapping confidence interval via Percentiles (1)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
quantile(TB, c(0.025, 0.975))

##    2.5%   97.5% 
## 21.0681 29.2681

Performance of Bootstrapping confidence interval via Percentiles (2)

Underlying truth:

\[E(\bar{X}^2) = E(\bar{X})^2 + var(\bar{X}) = \lambda^2 + \lambda/n\]

lambda <- 5
n <- 100
truth <- lambda^2 + lambda/n
B <- 1000
Repeats <- 100

counts <- 0

plot(c(0,100),c(0,Repeats), type="n", xlab="boot CI", ylab="repeats index")
abline(v = truth, col=2)

for(r in 1:Repeats){
  set.seed(r)
  X <- rpois(n, lambda)
  TB <- numeric(B)
  for(b in 1:B){
    set.seed(b)
    aX <- sample(X,n,replace = T)
    TB[b] <- (mean(aX))^2
  }
  segments(quantile(TB, c(0.025)), r, quantile(TB, c(0.975)), r)
  if(quantile(TB, c(0.025)) < truth & truth < quantile(TB, c(0.975))){
    counts <- counts + 1
  }
}

counts/Repeats

## [1] 0.93

Calculation of Bootstrapping confidence interval via Percentiles (3)

We can also obtain this Percentiles CI by boot package

library(boot)
myMean <- function(data, indices){
  d <- data[indices]
  mean(d)^2
}
boot_nonpara <- boot(data=X, statistic = myMean, R = B)
boot.ci(boot_nonpara, type="perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_nonpara, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (22.85, 31.02 )  
## Calculations and Intervals on Original Scale

Normal approximation

\[[\hat{T}_n - Z_{1 - \alpha/2}\hat{\sigma}_B, \hat{T}_n - Z_{\alpha/2}\hat{\sigma}_B],\] Where \(Z_\alpha = \Phi^{-1}(1-\alpha)\), \(\Phi\) is the cdf of standard Normal distribution.

\(Z_{0.025} = -1.96\)
\(Z_{0.975} = 1.96\)

Where \(\hat{T}_n\) is the estimator from the original sample and \(\hat{\sigma}_B\) is bootstrap se.

Implementation for Normal approximation

lambda <- 5
n <- 100
B <- 1000

set.seed(32611)
X <- rpois(n, lambda)
lambdaHat <- mean(X)
That <- lambdaHat^2 
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
ci_l <- That - 1.96*sd(TB)
ci_u <- That + 1.96*sd(TB)

c(ci_l, ci_u)

## [1] 20.79347 29.00673

Evaluation for Normal approximation

lambda <- 5
n <- 100
truth <- lambda^2 
B <- 1000
Repeats <- 100

counts <- 0

plot(c(0,100),c(0,Repeats), type="n", xlab="boot CI", ylab="repeats index")
abline(v = truth, col=2)

for(r in 1:Repeats){
  set.seed(r)
  X <- rpois(n, lambda)
  lambdaHat <- mean(X)
  That <- lambdaHat^2 
  TB <- numeric(B)
  for(b in 1:B){
    set.seed(b)
    aX <- sample(X,n,replace = T)
    TB[b] <- (mean(aX))^2
  }
  ci_l <- That - 1.96*sd(TB)
  ci_u <- That + 1.96*sd(TB)
  segments(ci_l, r, ci_u, r)
  if(ci_l < truth & truth < ci_u){
    counts <- counts + 1
  }
}

counts/Repeats

## [1] 0.93

Bootstrapping confidence interval via Pivotal Intervals

Won’t be covered this year.
If you are interested in this, checkout previous lecture notes https://caleb-huo.github.io/teaching/2017FALL/lectures/week11_bootstrapAndPermutation/bootstrap/bootstrap.html

Summary Bootstrapping confidence interval

Precedure	Theoritical guarantee	Fast	R package Boot?
Percentiles	No	Yes	Yes
Pivotal Intervals	Yes	No	Yes
Pivotal Intervals (simplified, no se)	Yes	Yes	No
normal approximation	Yes	Yes	Yes

HW, Large scale Bootstrapping exercise

For the HAPMAP data on hiperGator.
Calculate the sample contrivance matrix.
what is the Bootstrapping variance of the largest eigen value.
what is the Bootstrapping confidence interval of the largest eigen value.

Biostatistical Computing, PHC 6068

Bootstrapping

Outline

Bootstrapping (Motivating example 1, variance of “mean estimator”)

Bootstrapping (Motivating example 2, variance of “median estimator”)

Aim of bootstrapping: how to estimate variance

Variance of “median estimator”

Variance of “median estimator”, Monte Carlo method

Emperical distribution

Empirical process (visualization)

Empirical process

Variance of “median estimator”, Bootstrapping method

How to sample from the empirical distribution?

Variance of “median estimator”, Bootstrapping method

Bootstrapping Variance Estimator

Why Bootstrapping variance works?

The parametric Bootstrapping (Variance of “median estimator”)

The parametric Bootstrapping (Variance of “median estimator”)

Comparison study: using delta method, non-parametric Bootstrapping, and parametric Bootstrapping

Delta method

Bootstrapping (non-parametric)

Parametric Bootstrapping

Bootstrapping using R package

Bootstrap confidence interval

Bootstrapping confidence interval via Percentiles

Calculate Bootstrapping confidence interval via Percentiles (1)

Performance of Bootstrapping confidence interval via Percentiles (2)

Calculation of Bootstrapping confidence interval via Percentiles (3)

Normal approximation

Implementation for Normal approximation

Evaluation for Normal approximation

Bootstrapping confidence interval via Pivotal Intervals

Summary Bootstrapping confidence interval

HW, Large scale Bootstrapping exercise

Reference