Zhiguang Huo (Caleb)
Monday October 18, 2021
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \varepsilon, \text{with }\varepsilon \sim N(0, \sigma^2)\]
Assumptions:
\[Y = X\beta + \varepsilon\]
Assumptions:
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable
Basic syntax for lm()
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876
## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9814 -2.1185  0.2217  1.0717  7.5186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
## cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7171 
## F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10
The data is from the book element of statistical learning
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
Explanation of these variables:
##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678
par(mfrow=c(3,3))
for(i in 1:9){
  aname <- names(prostate)[i]
  hist(prostate[[i]], main=aname, col="steelblue", breaks=20)
}What can we learn from the histogram.
## [1] 0.7519045
maxID <- arrayInd(which(max(cor_prostate[cor_prostate!=1]) == cor_prostate)[1], dim(cor_prostate))
names(prostate)[maxID]## [1] "pgg45"   "gleason"
| Syntax | function | 
|---|---|
| + | Separate main effects | 
| : | Denote interactions | 
| * | main effects and interactions | 
| ˆn | include all main effects and n-order interactions | 
| - | remove the specified terms | 
| I() | Brackets the portions of a formula where operators are used mathematically | 
| . | Main effect for each column in the dataframe, except the response | 
## 
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
## 
## Coefficients:
## (Intercept)          age      lweight  
##   -1.897709     0.003318     1.147487
## [1] "lm"
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
| function | application | 
|---|---|
| summary() | summary of linear model | 
| coef() | get coefficient | 
| confint() | Confidence intervals for model parameters | 
| predict() | make a prediction | 
| fitted() | get fitted value, (e.g., \(X\hat{\beta}\)) | 
## 
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2665 -0.6753 -0.0361  0.5317  2.9712 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.897709   1.119033  -1.696   0.0932 .  
## age          0.003318   0.015369   0.216   0.8295    
## lweight      1.147487   0.267094   4.296 4.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.051 on 94 degrees of freedom
## Multiple R-squared:  0.1882, Adjusted R-squared:  0.1709 
## F-statistic: 10.89 on 2 and 94 DF,  p-value: 5.558e-05
##  (Intercept)          age      lweight 
## -1.897708604  0.003318361  1.147487156
##                   2.5 %    97.5 %
## (Intercept) -4.11957499 0.3241578
## age         -0.02719768 0.0338344
## lweight      0.61716615 1.6778082
##                 Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -1.897708604 1.11903307 -1.6958468 0.0932246747
## age          0.003318361 0.01536927  0.2159089 0.8295265858
## lweight      1.147487156 0.26709380  4.2961954 0.0000423256
##        1 
## 112.9472
##        1        2        3        4        5        6 
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246
| Model | Interpretation | 
|---|---|
| y ~ 1 | Just the intercept | 
| y ~ a | One main effect | 
| y ~ - 1 + a | No intercept | 
| y ~ a+b+c+a:b | Three main effects and an interaction between a and b | 
| y ~ a * b | All main effects and interactions (same as a+b+a:b) | 
| y ~ factor(a) | Create dummy variables for a (if not already a factor) | 
| y ~ (a+b+c)ˆ2 | All main effects and second-order interactions | 
| y ~ (a+b+c)ˆ2 - a:b | All main effects and second-order interactions except a:b | 
| y ~ I(aˆ2) | Transform a to aˆ2 | 
| log(y) ~ a | Transform y to log(y) | 
| y ~ . | Main effect for each column in the dataframe | 
lm(lpsa ~ 1, data = prostate)
lm(lpsa ~ age, data = prostate)
lm(lpsa ~ - 1 + age, data = prostate)
lm(lpsa ~ age + lweight + gleason + gleason:lweight, data = prostate)
lm(lpsa ~ gleason*lweight, data = prostate)
lm(lpsa ~ factor(gleason), data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2, data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2 - gleason:lweight, data = prostate)
lm(lpsa ~ I(age^2), data = prostate)
lm(log(age) ~ lpsa, data = prostate)
lm(lpsa ~ ., data = prostate)## 
## Call:
## lm(formula = lpsa ~ as.factor(gleason), data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.07390 -0.57561 -0.02379  0.60166  2.67156 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.7384     0.1730  10.051  < 2e-16 ***
## as.factor(gleason)7   1.1730     0.2205   5.320 7.12e-07 ***
## as.factor(gleason)8   0.4192     1.0377   0.404   0.6872    
## as.factor(gleason)9   1.1348     0.4892   2.320   0.0225 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.023 on 93 degrees of freedom
## Multiple R-squared:  0.2388, Adjusted R-squared:  0.2143 
## F-statistic: 9.726 on 3 and 93 DF,  p-value: 1.208e-05
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.726 1.21e-05 ***
## Residuals          93  97.37   1.047                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.726 1.21e-05 ***
## Residuals          93  97.37   1.047                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = lpsa ~ as.factor(gleason), data = prostate)
## 
## $`as.factor(gleason)`
##            diff        lwr      upr     p adj
## 7-6  1.17300730  0.5962186 1.749796 0.0000042
## 8-6  0.41918941 -2.2956321 3.134011 0.9775856
## 9-6  1.13480989 -0.1449692 2.414589 0.1008983
## 8-7 -0.75381789 -3.4544627 1.946827 0.8847334
## 9-7 -0.03819741 -1.2876217 1.211227 0.9998152
## 9-8  0.71562048 -2.2167219 3.647963 0.9192925
##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.646 1.34e-05 ***
## age                 1   0.25   0.246   0.233     0.63    
## Residuals          92  97.12   1.056                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| function | application | 
|---|---|
| fitted.values() | Returns fitted values | 
| residuals() | Returns residuals | 
| rstandard() | Standardized residuals, variance one; residual standardized using overall error variance | 
| rstudent() | Studentized residuals, variance one; residual standardized using leave-one-out measure of the error variance | 
| qqnorm() | Normal quantile plot | 
| qqline() | Add a line to the normal quantile plot | 
| plot() | Given a lm object produces six diagnostic plots, selected using the which argument; default is plots 1-3 and 5 | 
| vif() | Variance Inflation Factors, to check collinearity | 
##        1        2        3        4        5        6 
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246
##         1         2         3         4         5         6 
## -1.876911 -2.266503 -1.598536 -2.224233 -1.875070 -1.207778
set.seed(32611)
a0 <- rnorm(1000)
a <- sort(a0)
quantile_a <- seq_along(a) / (length(a) + 1) ## quantiles of the data from sampling
b <- qnorm(quantile_a)  ## data from the therotical normal distribution at the same quantiles
plot(b, a) ## QQ-plot## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(1000)
## W = 0.99873, p-value = 0.7059
## 
##  Shapiro-Wilk normality test
## 
## data:  1:1000
## W = 0.95481, p-value < 2.2e-16
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lm_prostate)
## W = 0.98133, p-value = 0.1836
## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16
##   lcavol  lweight      age     lbph      svi      lcp  gleason    pgg45 
## 2.102650 1.453325 1.336099 1.385040 1.955928 3.097954 2.468891 2.974075
## [1] 216.4952
## [1] 242.2423
## Start:  AIC=-60.78
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - gleason  1    0.0491 43.108 -62.668
## - pgg45    1    0.5102 43.569 -61.636
## - lcp      1    0.6814 43.740 -61.256
## <none>                 43.058 -60.779
## - lbph     1    1.3646 44.423 -59.753
## - age      1    1.7981 44.857 -58.810
## - lweight  1    4.6907 47.749 -52.749
## - svi      1    4.8803 47.939 -52.364
## - lcavol   1   20.1994 63.258 -25.467
## 
## Step:  AIC=-62.67
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - lcp      1    0.6684 43.776 -63.176
## <none>                 43.108 -62.668
## - pgg45    1    1.1987 44.306 -62.008
## - lbph     1    1.3844 44.492 -61.602
## - age      1    1.7579 44.865 -60.791
## - lweight  1    4.6429 47.751 -54.746
## - svi      1    4.8333 47.941 -54.360
## - lcavol   1   21.3191 64.427 -25.691
## 
## Step:  AIC=-63.18
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - pgg45    1    0.6607 44.437 -63.723
## <none>                 43.776 -63.176
## - lbph     1    1.3329 45.109 -62.266
## - age      1    1.4878 45.264 -61.934
## - svi      1    4.1766 47.953 -56.336
## - lweight  1    4.6553 48.431 -55.373
## - lcavol   1   22.7555 66.531 -24.572
## 
## Step:  AIC=-63.72
## lpsa ~ lcavol + lweight + age + lbph + svi
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 44.437 -63.723
## - age      1    1.1588 45.595 -63.226
## - lbph     1    1.5087 45.945 -62.484
## - lweight  1    4.3140 48.751 -56.735
## - svi      1    5.8509 50.288 -53.724
## - lcavol   1   25.9427 70.379 -21.119
## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph          svi  
##     0.49473      0.54400      0.58821     -0.01644      0.10122      0.71490
stepDetails <- step(lm(lpsa ~ ., data=prostate), trace=0)$anova
stepDetails$AIC[2] - stepDetails$AIC[1]## [1] -1.889365
## [1] -1.889365
library(minpack.lm)
## simulate data from sin function
set.seed(32611)
n <- 100
x <- runif(n,0,2*pi)
y <- sin(x) + rnorm(n,sd=0.2)
plot(x,y)getPred <- function(parS, xx) { 
    ## parS: parameter list
    ## xx: x values (predictors)
        parS$A * sin(xx + parS$B) + parS$C
    }
residFun <- function(p, observed, xx) observed - getPred(p,xx)
parStart <- list(A=1, B=3, C=0)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y,    xx = x)
nls.out## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: -1.03727642432862, 3.18141544265264, -0.0260329085154092 
## residual sum-of-squares: 3.783
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
aseq <- seq(0,2*pi,length.out = 1000)
plot(x,y)
with(nls.out$par, lines(aseq, A*sin(aseq + B) + C, col=2))getPred <- function(parS, xx) { 
        parS$A * exp(parS$B * xx)
    }
residFun <- function(p, observed, xx) observed - getPred(p,xx)
parStart <- list(A=3, B=3)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y,    xx = x)
nls.out## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: 0.963577134214599, 1.00769428203048 
## residual sum-of-squares: 95.74
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
continuity at the knots, from book the “element of statistical leaning”
How to parametrize the set of splines with knots at \(t_1, \ldots, t_m\)?
Given input \(x_1, \ldots, x_n\) and output \(y_1, \ldots, y_n\), we consider to fit function \(f\) with \(k^{th}\) order splines with given knots at \(t_1, \ldots, t_m\) \[f(x) = \sum_{j=1}^{m+k+1} \beta_j g_j(x),\] where \(\beta_j, 1\le j \le m+k+1\) are coefficients and \(g_j, 1\le j \le m+k+1\) are basis functions for order \(k\) spline on the knots \(t_1, \ldots, t_m\) (e.g. truncated power basis or B-spline basis).
define \(G \in \mathbb{R}^{n \times (m+k+1)}\), where \(G_{ij} = g_j(x_i)\).
The solution coefficients \(\hat{\beta} = (\hat{\beta}_1, \ldots, \hat{\beta}_{m+k+1})\)
\[\hat{\beta} = \arg \min_{\beta \in \mathbb{R}^{m+k+1}} \|y - G\beta\|_2^2\]
The problem with regression splines is that the estimates have high variance at the boundaries of the input domain.
library(splines)
plot(disp ~ mpg, data = mtcars, main = "data(mtcars)  &  natural splines")
#with(mtcars,  ns(mpg, df = 4))
spl_natural <- with(mtcars, lm(disp ~ ns(mpg, df = 4)))
mtcars_sort <- mtcars[order(mtcars$mpg), ] 
lines(mtcars_sort$mpg, predict(spl_natural, mtcars_sort), col = "blue")\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f^{((k+1)/2)})^2 dx\]
- first term is least square loss
- second term penalized if the devirative of $f$ is too wiggly.
\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f''(x))^2 dx\]
plot(disp ~ mpg, data = mtcars, main = "data(mtcars)  &  smoothing splines")
cars.spl <- with(mtcars, smooth.spline(mpg, disp))
cars.spl## Call:
## smooth.spline(x = mpg, y = disp)
## 
## Smoothing Parameter  spar= 0.9009065  lambda= 0.02109193 (11 iterations)
## Equivalent Degrees of Freedom (Df): 3.211475
## Penalized Criterion (RSS): 61580.33
## GCV: 3828.877