Biostatistical Computing, PHC 6068

Linear Model

Zhiguang Huo (Caleb)

Monday October 19, 2020

Outlines

linear regression model (population version)

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \varepsilon, \text{with }\varepsilon \sim N(0, \sigma^2)\]

Assumptions:

linear regression model (sample version)

\[Y = X\beta + \varepsilon\]

Assumptions:

Solution to linear regression model

Fit linear model in R, lm()

Basic syntax for lm()

lm(formula, data)
lmfit <- lm(mpg ~ cyl, data=mtcars)
lmfit
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876
summary(lmfit)
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9814 -2.1185  0.2217  1.0717  7.5186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
## cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7171 
## F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10
par(mfrow=c(2,2))
plot(lmfit)

Prostate cancer data

The data is from the book element of statistical learning

library(ElemStatLearn)
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
prostate$train <- NULL

Explanation of these variables:

head(prostate)
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Exploratory analysis

Distribution of each varaible – Histogram

par(mfrow=c(3,3))
for(i in 1:9){
  aname <- names(prostate)[i]
  hist(prostate[[i]], main=aname, col="steelblue", breaks=20)
}

What can we learn from the histogram.

Correlation between each pairs of variables

cor_prostate <- cor(prostate)
max(cor_prostate[cor_prostate!=1])
## [1] 0.7519045
maxID <- arrayInd(which(max(cor_prostate[cor_prostate!=1]) == cor_prostate)[1], dim(cor_prostate))
names(prostate)[maxID]
## [1] "pgg45"   "gleason"
heatmap(cor_prostate, scale="none") ## instead of image. Rowv = F, Colv = F to suppress hierarchical clusteirng dendrogram

visualizing the scattered plot of each pair of variables.

pairs(prostate,pch=19)

Formulas

Syntax function
+ Separate main effects
: Denote interactions
* main effects and interactions
ˆn include all main effects and n-order interactions
- remove the specified terms
I() Brackets the portions of a formula where operators are used mathematically
. Main effect for each column in the dataframe, except the response

Fit linear regression model

lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
lm_prostate
## 
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
## 
## Coefficients:
## (Intercept)          age      lweight  
##   -1.897709     0.003318     1.147487
class(lm_prostate)
## [1] "lm"
names(lm_prostate)
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Inference for Linear Models

function application
summary() summary of linear model
coef() get coefficient
confint() Confidence intervals for model parameters
predict() make a prediction
fitted() get fitted value, (e.g., \(X\hat{\beta}\))
summary(lm_prostate)
## 
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2665 -0.6753 -0.0361  0.5317  2.9712 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.897709   1.119033  -1.696   0.0932 .  
## age          0.003318   0.015369   0.216   0.8295    
## lweight      1.147487   0.267094   4.296 4.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.051 on 94 degrees of freedom
## Multiple R-squared:  0.1882, Adjusted R-squared:  0.1709 
## F-statistic: 10.89 on 2 and 94 DF,  p-value: 5.558e-05
coef(lm_prostate)
##  (Intercept)          age      lweight 
## -1.897708604  0.003318361  1.147487156
confint(lm_prostate)
##                   2.5 %    97.5 %
## (Intercept) -4.11957499 0.3241578
## age         -0.02719768 0.0338344
## lweight      0.61716615 1.6778082
anew <- data.frame(age = 29, lweight = 100)
predict(lm_prostate, anew)
##        1 
## 112.9472
head(fitted(lm_prostate))
##        1        2        3        4        5        6 
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246

Formulas (continue)

Model Interpretation
y ~ 1 Just the intercept
y ~ a One main effect
y ~ - 1 + a No intercept
y ~ a+b+c+a:b Three main effects and an interaction between a and b
y ~ a * b All main effects and interactions (same as a+b+a:b)
y ~ factor(a) Create dummy variables for a (if not already a factor)
y ~ (a+b+c)ˆ2 All main effects and second-order interactions
y ~ (a+b+c)ˆ2 - a:b All main effects and second-order interactions except a:b
y ~ I(aˆ2) Transform a to aˆ2
log(y) ~ a Transform y to log(y)
y ~ . Main effect for each column in the dataframe

Demonstrate other formula of lm using prostate cancer data

lm(lpsa ~ 1, data = prostate)
lm(lpsa ~ age, data = prostate)
lm(lpsa ~ - 1 + age, data = prostate)
lm(lpsa ~ age + lweight + gleason + gleason:lweight, data = prostate)
lm(lpsa ~ gleason*lweight, data = prostate)
lm(lpsa ~ factor(gleason), data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2, data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2 - gleason:lweight, data = prostate)
lm(lpsa ~ I(age^2), data = prostate)
lm(log(age) ~ lpsa, data = prostate)
lm(lpsa ~ ., data = prostate)

Anova (1)

blm <- lm(lpsa ~ as.factor(gleason), data = prostate)
summary(blm)
## 
## Call:
## lm(formula = lpsa ~ as.factor(gleason), data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.07390 -0.57561 -0.02379  0.60166  2.67156 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.7384     0.1730  10.051  < 2e-16 ***
## as.factor(gleason)7   1.1730     0.2205   5.320 7.12e-07 ***
## as.factor(gleason)8   0.4192     1.0377   0.404   0.6872    
## as.factor(gleason)9   1.1348     0.4892   2.320   0.0225 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.023 on 93 degrees of freedom
## Multiple R-squared:  0.2388, Adjusted R-squared:  0.2143 
## F-statistic: 9.726 on 3 and 93 DF,  p-value: 1.208e-05
summary.aov(blm)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.726 1.21e-05 ***
## Residuals          93  97.37   1.047                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova (2)

clm <- aov(lpsa ~ as.factor(gleason), data = prostate)
summary(clm)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.726 1.21e-05 ***
## Residuals          93  97.37   1.047                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(clm) ##Multiple comparisons, Tukey’s Honest Significant Difference
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = lpsa ~ as.factor(gleason), data = prostate)
## 
## $`as.factor(gleason)`
##            diff        lwr      upr     p adj
## 7-6  1.17300730  0.5962186 1.749796 0.0000042
## 8-6  0.41918941 -2.2956321 3.134011 0.9775856
## 9-6  1.13480989 -0.1449692 2.414589 0.1008983
## 8-7 -0.75381789 -3.4544627 1.946827 0.8847334
## 9-7 -0.03819741 -1.2876217 1.211227 0.9998152
## 9-8  0.71562048 -2.2167219 3.647963 0.9192925

ANCOVA

dlm <- aov(lpsa ~ as.factor(gleason) + age, data = prostate)
summary(dlm)
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.646 1.34e-05 ***
## age                 1   0.25   0.246   0.233     0.63    
## Residuals          92  97.12   1.056                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Diagnostics

function application
fitted.values() Returns fitted values
residuals() Returns residuals
rstandard() Standardized residuals, variance one; residual standardized using overall error variance
rstudent() Studentized residuals, variance one; residual standardized using leave-one-out measure of the error variance
qqnorm() Normal quantile plot
qqline() Add a line to the normal quantile plot
plot() Given a lm object produces six diagnostic plots, selected using the which argument; default is plots 1-3 and 5
vif() Variance Inflation Factors, to check collinearity

fitted value and residuals

lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
head(fitted(lm_prostate))
##        1        2        3        4        5        6 
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246
head(residuals(lm_prostate))
##         1         2         3         4         5         6 
## -1.876911 -2.266503 -1.598536 -2.224233 -1.875070 -1.207778

plot() on lm object options

  1. Residual versus fitted values
  2. Normal quantile-quantile plot
  3. sqrt(Standardized residuals) versus fitted values
  4. Cook’s distance versus row labels
  5. Standardized residuals versus leverage along with contours of Cook’s distance
  6. Cook’s distance versus leverage/(1-leverage) with sqrt(Standardized residuals) contours

Assess Gaussian assumption.

plot(lm_prostate, 1)

QQ-plot

QQ-plot 0

set.seed(32611)
a0 <- rnorm(1000)
a <- sort(a0)
quantile_a <- seq_along(a) / (length(a) + 1) ## quantiles of the data from sampling
b <- qnorm(quantile_a)  ## data from the therotical normal distribution at the same quantiles

plot(b, a) ## QQ-plot

QQ-plot 1

set.seed(32611)
a <- rnorm(1000)
qqnorm(a)
abline(a = 0,b = 1, col=2) ## x = y line

QQ-plot 2

set.seed(32611)
qqnorm(rnorm(1000, sd=10))
abline(a = 0,b = 1, col=2) ## x = y line

QQ-plot 3

set.seed(32611)
qqnorm(1:1000)
abline(a = 0,b = 1, col=2) ## x = y line

QQ-plot 4

set.seed(32611)
qqnorm(rexp(1000)) 
abline(a = 0,b = 1, col=2) ## x = y line

Prostate cancer example

lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
plot(lm_prostate, 2)

Formal normality test

set.seed(32611)
shapiro.test(rnorm(1000))
## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(1000)
## W = 0.99873, p-value = 0.7059
shapiro.test(1:1000)
## 
##  Shapiro-Wilk normality test
## 
## data:  1:1000
## W = 0.95481, p-value < 2.2e-16
shapiro.test(residuals(lm_prostate))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lm_prostate)
## W = 0.98133, p-value = 0.1836

Unusual and Influential Data

Outliers

plot(lm_prostate, 3)

Leverage

plot(lm_prostate, 5)

influential points: Cook’s distance

plot(lm_prostate, 4)

Fit a linear regression model with all variables

lm_all <- lm(lpsa ~ ., data=prostate) ## . Rest of all the other vaiables.
summary(lm_all)
## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

Variance inflation factor (VIF)

lm_all <- lm(lpsa ~ ., data=prostate) ## . Rest of all the other vaiables.
car::vif(lm_all)
##   lcavol  lweight      age     lbph      svi      lcp  gleason    pgg45 
## 2.102650 1.453325 1.336099 1.385040 1.955928 3.097954 2.468891 2.974075

Model selection

AIC, BIC

AIC(lm_all)
## [1] 216.4952
BIC(lm_all)
## [1] 242.2423

backward selection

step(lm(lpsa ~ ., data=prostate)) ## default is backward selection
## Start:  AIC=-60.78
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - gleason  1    0.0491 43.108 -62.668
## - pgg45    1    0.5102 43.569 -61.636
## - lcp      1    0.6814 43.740 -61.256
## <none>                 43.058 -60.779
## - lbph     1    1.3646 44.423 -59.753
## - age      1    1.7981 44.857 -58.810
## - lweight  1    4.6907 47.749 -52.749
## - svi      1    4.8803 47.939 -52.364
## - lcavol   1   20.1994 63.258 -25.467
## 
## Step:  AIC=-62.67
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - lcp      1    0.6684 43.776 -63.176
## <none>                 43.108 -62.668
## - pgg45    1    1.1987 44.306 -62.008
## - lbph     1    1.3844 44.492 -61.602
## - age      1    1.7579 44.865 -60.791
## - lweight  1    4.6429 47.751 -54.746
## - svi      1    4.8333 47.941 -54.360
## - lcavol   1   21.3191 64.427 -25.691
## 
## Step:  AIC=-63.18
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - pgg45    1    0.6607 44.437 -63.723
## <none>                 43.776 -63.176
## - lbph     1    1.3329 45.109 -62.266
## - age      1    1.4878 45.264 -61.934
## - svi      1    4.1766 47.953 -56.336
## - lweight  1    4.6553 48.431 -55.373
## - lcavol   1   22.7555 66.531 -24.572
## 
## Step:  AIC=-63.72
## lpsa ~ lcavol + lweight + age + lbph + svi
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 44.437 -63.723
## - age      1    1.1588 45.595 -63.226
## - lbph     1    1.5087 45.945 -62.484
## - lweight  1    4.3140 48.751 -56.735
## - svi      1    5.8509 50.288 -53.724
## - lcavol   1   25.9427 70.379 -21.119
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph          svi  
##     0.49473      0.54400      0.58821     -0.01644      0.10122      0.71490
stepDetails <- step(lm(lpsa ~ ., data=prostate), trace=0)$anova
stepDetails$AIC[2] - stepDetails$AIC[1]
## [1] -1.889365
AIC(lm(lpsa ~ . - gleason, data=prostate)) - AIC(lm(lpsa ~ ., data=prostate))
## [1] -1.889365

non-linear predictor (1)

library(minpack.lm)

## simulate data from sin function
set.seed(32611)
n <- 100
x <- runif(n,0,2*pi)
y <- sin(x) + rnorm(n,sd=0.2)
plot(x,y)

fit \(y = A \sin(x + B) + C\)

getPred <- function(parS, xx) { 
    ## parS: parameter list
    ## xx: x values (predictors)
        parS$A * sin(xx + parS$B) + parS$C
    }

residFun <- function(p, observed, xx) observed - getPred(p,xx)

parStart <- list(A=1, B=3, C=0)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y,    xx = x)
nls.out
## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: -1.03727642432862, 3.18141544265264, -0.0260329085154092 
## residual sum-of-squares: 3.783
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
aseq <- seq(0,2*pi,length.out = 1000)
plot(x,y)
with(nls.out$par, lines(aseq, A*sin(aseq + B) + C, col=2))

non-linear predictor (2)

set.seed(32611)
n <- 100; 
x <- runif(n,0,5)
y <- exp(x) + rnorm(n,sd=1)
plot(x,y)

fit \(y = a \times \exp(bx)\)

getPred <- function(parS, xx) { 
        parS$A * exp(parS$B * xx)
    }

residFun <- function(p, observed, xx) observed - getPred(p,xx)

parStart <- list(A=3, B=3)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y,    xx = x)
nls.out
## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: 0.963577134214599, 1.00769428203048 
## residual sum-of-squares: 95.74
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
aseq <- seq(0,5,length.out = 1000)
plot(x,y)
with(nls.out$par, lines(aseq, A*exp(B*aseq), col=2))

Splines (Optional)

Splines motivating example

continuity at the knots, from book the “element of statistical leaning”

continuity at the knots, from book the “element of statistical leaning”

Splines basis

How to parametrize the set of splines with knots at \(t_1, \ldots, t_m\)?

Splines types

Regression splines

\[\hat{\beta} = \arg \min_{\beta \in \mathbb{R}^{m+k+1}} \|y - G\beta\|_2^2\]

Natural splines

natural spline in R (data points)

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)")

natural spline in R

library(splines)

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)  &  natural splines")
#with(mtcars,  ns(mpg, df = 4))
spl_natural <- with(mtcars, lm(disp ~ ns(mpg, df = 4)))
mtcars_sort <- mtcars[order(mtcars$mpg), ] 
lines(mtcars_sort$mpg, predict(spl_natural, mtcars_sort), col = "blue")

Smoothing splines

\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f^{((k+1)/2)})^2 dx\]

- first term is least square loss
- second term penalized if the devirative of $f$ is too wiggly.

\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f''(x))^2 dx\]

cubic smoothing spline in R (data points)

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)")

cubic smoothing spline in R

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)  &  smoothing splines")
cars.spl <- with(mtcars, smooth.spline(mpg, disp))
cars.spl
## Call:
## smooth.spline(x = mpg, y = disp)
## 
## Smoothing Parameter  spar= 0.9009065  lambda= 0.02109193 (11 iterations)
## Equivalent Degrees of Freedom (Df): 3.211475
## Penalized Criterion (RSS): 61580.33
## GCV: 3828.877
lines(cars.spl, col = "blue")