Outline

Bootstrapping
- Bootstrapping variance, confidence interval
- non-parametric Bootstrapping, parametric Bootstrapping

International Prize in Statistics Awarded to Bradley Efron, for his contribution in Bootstrapping (annouced in 11/12/2018)

Bootstrapping (Motivating example 1, variance of “mean estimator”)

Given \(X_1, \ldots, X_n\) samples from \(N(\mu, \sigma^2)\). How to estimate the mean of this distribution, \(\hat{\mu}\)?
- Use \(\frac{1}{n} \sum_{i=1}^n X_i\), by weak law of large number: \[\hat{\mu} = \frac{1}{n} \sum_{i=1}^n X_i \rightarrow E(X_i)\]
How to estimate the variance of \(\hat{\mu}\)?
- Similarly use weak law of large number. \[Var(\hat{\mu}) = Var(\frac{1}{n} \sum_{i=1}^n X_i) = \frac{1}{n^2} \sum_{i=1}^n Var(X_i) = \frac{1}{n} \sigma^2 \]

\[\hat{Var}(\hat{\mu}) =\frac{1}{n} \hat{\sigma}^2 = \frac{1}{n(n-1)} \sum_{i=1}^n (X_i - \bar{X})^2\]

Bootstrapping (Motivating example 2, variance of “median estimator”)

Given \(X_1, \ldots, X_n\) samples from \(N(\mu, \sigma_0^2)\). How to estimate the median of this distribution, \(T_n\)?
- Use \(T_n = median_{i=1}^n(X_i)\) be a statistic, where \(T_n\) is median of \(X_1, \ldots, X_n\).
How to estimate \(V_F(T_n)\), the variance of \(T_n\)?
- Seems difficult.

Aim of bootstrapping: how to estimate variance

Suppose \(X_1, \ldots, X_n \sim F\) where \(F\) is a distribution.
Let \(T_n = g(X_1, \ldots, X_n)\) be a statistic, where \(T_n\) is a function of the data.
What is the variance of \(T_n\), \(V_F(T_n)\)?

Variance of “median estimator”

Given \(X_1, \ldots, X_n\) samples from \(N(\mu_0, \sigma_0^2)\), where both \(\mu_0\) and \(\sigma_0^2\) are known. How to estimate the median of this distribution, \(T_n\)?
- Let \(T_n = median_{i=1}^n(X_i)\) be a statistic, where \(T_n\) is median of \(X_1, \ldots, X_n\).
How to estimate the variance of \(T_n\)?
- We can use Monte Carlo simulation method to estimate the variance of \(T_n\).

Monte Carlo method

For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim N(\mu_0,\sigma_0^2)\)
- Compute \(T^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{T}_n = \frac{1}{B} \sum_{b=1}^B T^{(b)}_n\)
\(\hat{V}_F (T_n) = \frac{1}{B - 1} \sum_{b=1}^B \{T^{(b)}_n - \bar{T}_n \}^2\)
By law of large number, \(\hat{V}_F (T_n) \rightarrow V_F(T_n)\)

Variance of “median estimator”, Monte Carlo method

mu <- 1
sigma <- 1
n <- 100
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- rnorm(n, mu, sigma)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)

## [1] 0.014573

What if we don’t know \(\mu\), and \(\sigma^2\)?
- We only have observed data.

Emperical distribution

\(F = P(X \le x)\) is a distribution function.
We can estimate \(F\) with the empirical distribution function \(F_n\), the cdf that puts mass \(1/n\) at each data point \(X_i\). \[F_n (x) = \frac{1}{n} \sum_{i = 1}^n I(X_i \le x)\] where \[\begin{equation} I(X_i \le x) = \begin{cases} 1, & \text{if}\ X_i \le x \\ 0, & \text{if}\ X_i > x \end{cases} \end{equation}\]

Empirical process (visualization)

library(ggplot2)
n <- 1000
df <- data.frame(x = c(rnorm(n, 0, 1)))
base <- ggplot(df, aes(x)) + stat_ecdf()
base + stat_function(fun = pnorm, colour = "red") + xlim(c(-3,3))

## Warning: Removed 2 rows containing non-finite values (stat_ecdf).

Empirical process

Empirical distribution is close to the underlying distribution

Glivenko-Cantelli Theorem \[\sup_x | F_n(x) - F(x)| \rightarrow 0\]
Dvoretzky-Kiefer-Wolfowitz inequality, for any \(\varepsilon > 0\) \[P(\sup_x | F_n(x) - F(x)| > \varepsilon) \le 2\exp(-2n\varepsilon^2)\]

Variance of the “median estimator”, Bootstrapping method

Instead of drawing samples from the underlying distribution \(F \sim N(\mu, \sigma^2)\), we draw from the empirical distribution \(F_n\)

For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim\) ~~N(mu,sigma^2)~~ \(F_n\)
- Compute \(T^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{T}_n = \frac{1}{B} \sum_{b=1}^B T^{(b)}_n\)
\(\hat{V}_{F_n} (T_n) = \frac{1}{B-1} \sum_{b=1}^B \{T^{(b)}_n - \bar{T}_n \}^2\)

How to sample from the empirical distribution?

Drawing \(X_1^*, \ldots, X_n^*\) from \(F_n\) is equivalent to draw \(n\) observations, with replacement from the original data \(\{X_1, \ldots, X_n\}\).
Therefore, Bootstrapping sampling is also described as resampling data.
For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim \{ X_1, \ldots, X_n \}\) with replacement.
- Compute \(T^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{T}_n = \frac{1}{B} \sum_{b=1}^B T^{(b)}_n\)
\(\hat{V}_{F_n} (T_n) = \frac{1}{B-1} \sum_{b=1}^B \{T^{(b)}_n - \bar{T}_n \}^2\)

Variance of “median estimator”, Bootstrapping method

mu <- 1
sigma <- 1
n <- 100
set.seed(32611)
X <- rnorm(n, mu, sigma)
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- sample(X, replace = T)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)

## [1] 0.02598333

Bootstrapping Variance Estimator

Draw a bootstrap sample \(X_1^*, \ldots, X_n^* \sim F_n\), where \(F_n\) is the emperical CDF. Compute \({T^*}_n = g(X_1^*, \ldots, X_n^*)\).
Repeat the previous step \(B\) times, yielding estimators \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\).
Compute \[\hat{Var}_{F_n}({T}_n) = \frac{1}{B-1}\sum_{b=1}^B ({T^*}_n^{(b)} - \bar{T}^*)^2,\] where \(\bar{T}^* = \frac{1}{B}\sum_{b=1}^B {T^*}_n^{(b)}\)
Output \(\hat{Var}_{F_n}({T}_n)\) as the bootstrap variance of \({T}_n\).

Why Bootstrapping variance works?

\(T_n = g(X_1, \ldots, X_n)\)
\(mean_F(T_n) = \int g(X_1, \ldots, X_n) f(X) dX = \int g(X_1, \ldots, X_n) dF\)
\(Var_F(T_n) = \int (g(X_1, \ldots, X_n) - mean_F(T_n))^2 dF\)

Since in general, we don’t know distribution \(F\), we will calculate using the empirical CDF \(F_n\).

\(Var_{F_n}(T_n) = \int (g(X_1, \ldots, X_n) - mean_{F_n}(T_n))^2 dF_n\)
Finally, we used bootstrap variance \(\hat{Var}_{F_n}(T_n)\) to estimate \(Var_{F_n}(T_n)\).

To summarize:

Estimation error: \(Var_F(T_n) - Var_{F_n}(T_n) = O_p(1/\sqrt{n})\)
Simulation error: \(Var_{F_n}(T_n) - \hat{Var}_{F_n}(T_n) = O_p(1/\sqrt{B})\)

The parametric Bootstrapping (Variance of “median estimator”)

Calculate \(\hat{\mu} = \arg\max L(\mu;X_1^n)\), \(\sigma\) is known.
For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n \sim N(\hat{\mu},\sigma^2)\) (\(\hat{F}\))
- Compute \(\hat{T}^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{\hat{T}}_n = \frac{1}{B} \sum_{b=1}^B \hat{T}^{(b)}_n\)
\(\hat{V}_F (T_n) = \frac{1}{B-1} \sum_{b=1}^B \{\hat{T}^{(b)}_n - \bar{\hat{T}}_n \}^2\)

The parametric Bootstrapping (Variance of “median estimator”)

mu <- 1
sigma <- 1
n <- 100
set.seed(32611)
X <- rnorm(n, mu, sigma)
muhat <- mean(X)
B <- 1000
Ts <- numeric(B)

for(b in 1:B){
  set.seed(b)
  ax <- rnorm(n,muhat,sigma)
  Ts[b] <- median(ax)
}

varTest <- var(Ts)
print(varTest)

## [1] 0.014573

Comparison study

\(X_1, \ldots, X_n \sim POI(\lambda)\) (\(\lambda = 5\)), while we only observe the data \(X_1, \ldots, X_n\) and know data is from Poisson.
\(n = 100\)
\(T_n = (\frac{1}{n}\sum_{i=1}^nX_i)^2\)
What is the variance of \(T_n\)?
Methods:
- delta method
- non-parametric Bootstrapping
- parametric Bootstrapping
- simulation (requires knowning the underlying parameter \(\lambda\))

Delta method

\(\hat{\lambda} = \bar{X}\)
\(var(\bar{X}) = \frac{\lambda}{n}\)
\(var(T_n) = var(\bar{X}^2) = (2 \lambda)^2 \times var(\bar{X}) = \frac{4\lambda^3}{n}\)
\(\hat{var}_F(T_n) = \frac{4\hat{\lambda}^3}{n} = \frac{4\bar{X}^3}{n}\)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
var_hat1 <- 4*mean(X)^3/n
print(var_hat1)

## [1] 4.97006

Bootstrapping (non-parametric)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
var_hat2 <- var(TB)
print(var_hat2)

## [1] 4.44549

Parametric Bootstrapping

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
lambdaHat <- mean(X)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- rpois(n, lambdaHat)
  TB[b] <- (mean(aX))^2
}
var_hat3 <- var(TB)
print(var_hat3)

## [1] 5.261807

Bootstrapping using R package

library(boot)
myMean <- function(data, indices){
  d <- data[indices]
  mean(d)^2
}

## non parametric bootstrap
set.seed(32611)
boot_nonpara <- boot(data=X, statistic = myMean, R = B)
var(boot_nonpara$t)

##          [,1]
## [1,] 4.495394

## parametric bootstrap
genPois <- function(data, lambda){
  rpois(length(data), lambda)
}
boot_para <- boot(data=X, statistic = myMean, R = B, sim="parametric", ran.gen = genPois, mle = mean(X))
var(boot_para$t)

##          [,1]
## [1,] 5.199379

Simulation

lambda <- 5
n <- 100
B <- 1000
Ts <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- rpois(n, lambda)
  Ts[b] <- (mean(aX))^2
}
print(var(Ts))

## [1] 5.28793

Summary

Assume \(\sigma\) is known.
For \(b = 1, \ldots, B\)
- draw \(X^{b}_1, \ldots, X^{b}_n\) from
  - Simulation: \(N(\mu,\sigma^2)\)
  - Non-parametric bootstrapping: \(X^{b}_1, \ldots, X^{b}_n \sim \{ X_1, \ldots, X_n \}\) with replacement.
  - Parametric bootstrapping: \(N(\hat{\mu},\sigma^2)\)
- Compute \(T^{(b)}_n = median_{i=1}^n(X^{b}_i)\)
\(\bar{T}_n = \frac{1}{B} \sum_{b=1}^B T^{(b)}_n\)
\(\hat{V}_{F_n} (T_n) = \frac{1}{B-1} \sum_{b=1}^B \{T^{(b)}_n - \bar{T}_n \}^2\)

Summary

Goal: to estimate the variance of an estimator

Method	Need simulation?	Need underlying parameter
Delta method	N	N
Non-parametric bootstrapping	Y	N
Parametric bootstrapping	Y	N
Simulation	Y	Y

Comparison between Simulation and Non-parametric bootstrapping

Simulation: Monte Carlo Simulation from underlying distribution.
Non-parametric bootstrapping: Monte Carlo Simulation from the empirical distribution.

The rest of the percedures are the same

Bootstrap confidence interval

Percentiles
normal approximation
Pivotal Intervals

Bootstrapping confidence interval via Percentiles

Draw a bootstrap sample \(X_1^*, \ldots, X_n^* \sim F_n\). Compute \({T^*}_n = g(X_1^*, \ldots, X_n^*)\).
Repeat the previous step \(B\) times, yielding estimators \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\).
Rank \({T^*}_n^{(1)}, \ldots, {T^*}_n^{(B)}\) such that \({T^r}_n^{(1)} \le {T^r}_n^{(2)} \le \ldots \le {T^r}_n^{(B)}\)

We can define 95% confidence interval using (if B = 10,000) \[[{T^r}_n^{(250)}, {T^r}_n^{(9750)}]\]

Calculate Bootstrapping confidence interval via Percentiles (1)

lambda <- 5
n <- 100
set.seed(32611)
X <- rpois(n, lambda)
B <- 1000
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
quantile(TB, c(0.025, 0.975))

##     2.5%    97.5% 
## 20.97411 29.16000

Performance of Bootstrapping confidence interval via Percentiles (2)

Underlying truth:

\[E(\bar{X}^2) = E(\bar{X})^2 + var(\bar{X}) = \lambda^2 + \lambda/n\]

lambda <- 5
n <- 100
truth <- lambda^2 + lambda/n
B <- 1000
Repeats <- 100

counts <- 0

plot(c(0,100),c(0,Repeats), type="n", xlab="boot CI", ylab="repeats index")
abline(v = truth, col=2)

for(r in 1:Repeats){
  set.seed(r)
  X <- rpois(n, lambda)
  TB <- numeric(B)
  for(b in 1:B){
    set.seed(b)
    aX <- sample(X,n,replace = T)
    TB[b] <- (mean(aX))^2
  }
  segments(quantile(TB, c(0.025)), r, quantile(TB, c(0.975)), r)
  if(quantile(TB, c(0.025)) < truth & truth < quantile(TB, c(0.975))){
    counts <- counts + 1
  }
}

counts/Repeats

## [1] 0.93

Calculation of Bootstrapping confidence interval via Percentiles (3)

We can also obtain this Percentiles CI by boot package

library(boot)
myMean <- function(data, indices){
  d <- data[indices] ## in this example, data is a vector
  mean(d)^2
}
boot_nonpara <- boot(data=X, statistic = myMean, R = B)
boot.ci(boot_nonpara, type="perc")

## BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
## Based on 1000 bootstrap replicates
## 
## CALL : 
## boot.ci(boot.out = boot_nonpara, type = "perc")
## 
## Intervals : 
## Level     Percentile     
## 95%   (22.56, 30.80 )  
## Calculations and Intervals on Original Scale

Normal approximation

\[[\hat{T}_n - Z_{1 - \alpha/2}\hat{\sigma}_B, \hat{T}_n - Z_{\alpha/2}\hat{\sigma}_B],\] Where \(Z_\alpha = \Phi^{-1}(1-\alpha)\), \(\Phi\) is the cdf of standard Normal distribution.

\(Z_{0.025} = -1.96\)
\(Z_{0.975} = 1.96\)

Where \(\hat{T}_n\) is the estimator from the original sample and \(\hat{\sigma}_B\) is bootstrap se.

Implementation for Normal approximation

lambda <- 5
n <- 100
B <- 1000

set.seed(32611)
X <- rpois(n, lambda)
lambdaHat <- mean(X)
That <- lambdaHat^2 
TB <- numeric(B)
for(b in 1:B){
  set.seed(b)
  aX <- sample(X,n,replace = T)
  TB[b] <- (mean(aX))^2
}
ci_l <- That - 1.96*sd(TB)
ci_u <- That + 1.96*sd(TB)

c(ci_l, ci_u)

## [1] 20.76757 29.03263

Evaluation for Normal approximation

lambda <- 5
n <- 100
truth <- lambda^2 
B <- 1000
Repeats <- 100

counts <- 0

plot(c(0,100),c(0,Repeats), type="n", xlab="boot CI", ylab="repeats index")
abline(v = truth, col=2)

for(r in 1:Repeats){
  set.seed(r)
  X <- rpois(n, lambda)
  lambdaHat <- mean(X)
  That <- lambdaHat^2 
  TB <- numeric(B)
  for(b in 1:B){
    set.seed(b)
    aX <- sample(X,n,replace = T)
    TB[b] <- (mean(aX))^2
  }
  ci_l <- That - 1.96*sd(TB)
  ci_u <- That + 1.96*sd(TB)
  segments(ci_l, r, ci_u, r)
  if(ci_l < truth & truth < ci_u){
    counts <- counts + 1
  }
}

counts/Repeats

## [1] 0.93

Bootstrapping confidence interval via Pivotal Intervals

Won’t be covered this year.
If you are interested in this, checkout previous lecture notes https://caleb-huo.github.io/teaching/2017FALL/lectures/week11_bootstrapAndPermutation/bootstrap/bootstrap.html

Summary Bootstrapping confidence interval

Precedure	Theoritical guarantee	Fast	R package Boot?
Percentiles	No	Yes	Yes
Pivotal Intervals	Yes	No	Yes
Pivotal Intervals (simplified, no se)	Yes	Yes	No
normal approximation	Yes	Yes	Yes

HW, Large scale Bootstrapping exercise

For the HAPMAP data on hiperGator.
Calculate the sample contrivance matrix.
what is the Bootstrapping variance of the largest eigen value.
what is the Bootstrapping confidence interval of the largest eigen value.

Biostatistical Computing, PHC 6068

Bootstrapping

Outline

Bootstrapping (Motivating example 1, variance of “mean estimator”)

Bootstrapping (Motivating example 2, variance of “median estimator”)

Aim of bootstrapping: how to estimate variance

Variance of “median estimator”

Variance of “median estimator”, Monte Carlo method

Emperical distribution

Empirical process (visualization)

Empirical process

Variance of the “median estimator”, Bootstrapping method

How to sample from the empirical distribution?

Variance of “median estimator”, Bootstrapping method

Bootstrapping Variance Estimator

Why Bootstrapping variance works?

The parametric Bootstrapping (Variance of “median estimator”)

The parametric Bootstrapping (Variance of “median estimator”)

Comparison study

Delta method

Bootstrapping (non-parametric)

Parametric Bootstrapping

Bootstrapping using R package

Simulation

Summary

Summary

Comparison between Simulation and Non-parametric bootstrapping

Bootstrap confidence interval

Bootstrapping confidence interval via Percentiles

Calculate Bootstrapping confidence interval via Percentiles (1)

Performance of Bootstrapping confidence interval via Percentiles (2)

Calculation of Bootstrapping confidence interval via Percentiles (3)

Normal approximation

Implementation for Normal approximation

Evaluation for Normal approximation

Bootstrapping confidence interval via Pivotal Intervals

Summary Bootstrapping confidence interval

HW, Large scale Bootstrapping exercise

Reference