Zhiguang Huo (Caleb)
Monday October 19, 2020
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \varepsilon, \text{with }\varepsilon \sim N(0, \sigma^2)\]
Assumptions:
\[Y = X\beta + \varepsilon\]
Assumptions:
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable
Basic syntax for lm()
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9814 -2.1185 0.2217 1.0717 7.5186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
## cyl -2.8758 0.3224 -8.92 6.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
## F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
The data is from the book element of statistical learning
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
Explanation of these variables:
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
What can we learn from the histogram.
## [1] 0.7519045
maxID <- arrayInd(which(max(cor_prostate[cor_prostate!=1]) == cor_prostate)[1], dim(cor_prostate))
names(prostate)[maxID]
## [1] "pgg45" "gleason"
Syntax | function |
---|---|
+ | Separate main effects |
: | Denote interactions |
* | main effects and interactions |
ˆn | include all main effects and n-order interactions |
- | remove the specified terms |
I() | Brackets the portions of a formula where operators are used mathematically |
. | Main effect for each column in the dataframe, except the response |
##
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
##
## Coefficients:
## (Intercept) age lweight
## -1.897709 0.003318 1.147487
## [1] "lm"
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
function | application |
---|---|
summary() | summary of linear model |
coef() | get coefficient |
confint() | Confidence intervals for model parameters |
predict() | make a prediction |
fitted() | get fitted value, (e.g., \(X\hat{\beta}\)) |
##
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2665 -0.6753 -0.0361 0.5317 2.9712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.897709 1.119033 -1.696 0.0932 .
## age 0.003318 0.015369 0.216 0.8295
## lweight 1.147487 0.267094 4.296 4.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.051 on 94 degrees of freedom
## Multiple R-squared: 0.1882, Adjusted R-squared: 0.1709
## F-statistic: 10.89 on 2 and 94 DF, p-value: 5.558e-05
## (Intercept) age lweight
## -1.897708604 0.003318361 1.147487156
## 2.5 % 97.5 %
## (Intercept) -4.11957499 0.3241578
## age -0.02719768 0.0338344
## lweight 0.61716615 1.6778082
## 1
## 112.9472
## 1 2 3 4 5 6
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246
Model | Interpretation |
---|---|
y ~ 1 | Just the intercept |
y ~ a | One main effect |
y ~ - 1 + a | No intercept |
y ~ a+b+c+a:b | Three main effects and an interaction between a and b |
y ~ a * b | All main effects and interactions (same as a+b+a:b) |
y ~ factor(a) | Create dummy variables for a (if not already a factor) |
y ~ (a+b+c)ˆ2 | All main effects and second-order interactions |
y ~ (a+b+c)ˆ2 - a:b | All main effects and second-order interactions except a:b |
y ~ I(aˆ2) | Transform a to aˆ2 |
log(y) ~ a | Transform y to log(y) |
y ~ . | Main effect for each column in the dataframe |
lm(lpsa ~ 1, data = prostate)
lm(lpsa ~ age, data = prostate)
lm(lpsa ~ - 1 + age, data = prostate)
lm(lpsa ~ age + lweight + gleason + gleason:lweight, data = prostate)
lm(lpsa ~ gleason*lweight, data = prostate)
lm(lpsa ~ factor(gleason), data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2, data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2 - gleason:lweight, data = prostate)
lm(lpsa ~ I(age^2), data = prostate)
lm(log(age) ~ lpsa, data = prostate)
lm(lpsa ~ ., data = prostate)
##
## Call:
## lm(formula = lpsa ~ as.factor(gleason), data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.07390 -0.57561 -0.02379 0.60166 2.67156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7384 0.1730 10.051 < 2e-16 ***
## as.factor(gleason)7 1.1730 0.2205 5.320 7.12e-07 ***
## as.factor(gleason)8 0.4192 1.0377 0.404 0.6872
## as.factor(gleason)9 1.1348 0.4892 2.320 0.0225 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.023 on 93 degrees of freedom
## Multiple R-squared: 0.2388, Adjusted R-squared: 0.2143
## F-statistic: 9.726 on 3 and 93 DF, p-value: 1.208e-05
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(gleason) 3 30.55 10.183 9.726 1.21e-05 ***
## Residuals 93 97.37 1.047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(gleason) 3 30.55 10.183 9.726 1.21e-05 ***
## Residuals 93 97.37 1.047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = lpsa ~ as.factor(gleason), data = prostate)
##
## $`as.factor(gleason)`
## diff lwr upr p adj
## 7-6 1.17300730 0.5962186 1.749796 0.0000042
## 8-6 0.41918941 -2.2956321 3.134011 0.9775856
## 9-6 1.13480989 -0.1449692 2.414589 0.1008983
## 8-7 -0.75381789 -3.4544627 1.946827 0.8847334
## 9-7 -0.03819741 -1.2876217 1.211227 0.9998152
## 9-8 0.71562048 -2.2167219 3.647963 0.9192925
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(gleason) 3 30.55 10.183 9.646 1.34e-05 ***
## age 1 0.25 0.246 0.233 0.63
## Residuals 92 97.12 1.056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
function | application |
---|---|
fitted.values() | Returns fitted values |
residuals() | Returns residuals |
rstandard() | Standardized residuals, variance one; residual standardized using overall error variance |
rstudent() | Studentized residuals, variance one; residual standardized using leave-one-out measure of the error variance |
qqnorm() | Normal quantile plot |
qqline() | Add a line to the normal quantile plot |
plot() | Given a lm object produces six diagnostic plots, selected using the which argument; default is plots 1-3 and 5 |
vif() | Variance Inflation Factors, to check collinearity |
## 1 2 3 4 5 6
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246
## 1 2 3 4 5 6
## -1.876911 -2.266503 -1.598536 -2.224233 -1.875070 -1.207778
set.seed(32611)
a0 <- rnorm(1000)
a <- sort(a0)
quantile_a <- seq_along(a) / (length(a) + 1) ## quantiles of the data from sampling
b <- qnorm(quantile_a) ## data from the therotical normal distribution at the same quantiles
plot(b, a) ## QQ-plot
##
## Shapiro-Wilk normality test
##
## data: rnorm(1000)
## W = 0.99873, p-value = 0.7059
##
## Shapiro-Wilk normality test
##
## data: 1:1000
## W = 0.95481, p-value < 2.2e-16
##
## Shapiro-Wilk normality test
##
## data: residuals(lm_prostate)
## W = 0.98133, p-value = 0.1836
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76644 -0.35510 -0.00328 0.38087 1.55770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.181561 1.320568 0.137 0.89096
## lcavol 0.564341 0.087833 6.425 6.55e-09 ***
## lweight 0.622020 0.200897 3.096 0.00263 **
## age -0.021248 0.011084 -1.917 0.05848 .
## lbph 0.096713 0.057913 1.670 0.09848 .
## svi 0.761673 0.241176 3.158 0.00218 **
## lcp -0.106051 0.089868 -1.180 0.24115
## gleason 0.049228 0.155341 0.317 0.75207
## pgg45 0.004458 0.004365 1.021 0.31000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
## F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
## lcavol lweight age lbph svi lcp gleason pgg45
## 2.102650 1.453325 1.336099 1.385040 1.955928 3.097954 2.468891 2.974075
## [1] 216.4952
## [1] 242.2423
## Start: AIC=-60.78
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
##
## Df Sum of Sq RSS AIC
## - gleason 1 0.0491 43.108 -62.668
## - pgg45 1 0.5102 43.569 -61.636
## - lcp 1 0.6814 43.740 -61.256
## <none> 43.058 -60.779
## - lbph 1 1.3646 44.423 -59.753
## - age 1 1.7981 44.857 -58.810
## - lweight 1 4.6907 47.749 -52.749
## - svi 1 4.8803 47.939 -52.364
## - lcavol 1 20.1994 63.258 -25.467
##
## Step: AIC=-62.67
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
##
## Df Sum of Sq RSS AIC
## - lcp 1 0.6684 43.776 -63.176
## <none> 43.108 -62.668
## - pgg45 1 1.1987 44.306 -62.008
## - lbph 1 1.3844 44.492 -61.602
## - age 1 1.7579 44.865 -60.791
## - lweight 1 4.6429 47.751 -54.746
## - svi 1 4.8333 47.941 -54.360
## - lcavol 1 21.3191 64.427 -25.691
##
## Step: AIC=-63.18
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
##
## Df Sum of Sq RSS AIC
## - pgg45 1 0.6607 44.437 -63.723
## <none> 43.776 -63.176
## - lbph 1 1.3329 45.109 -62.266
## - age 1 1.4878 45.264 -61.934
## - svi 1 4.1766 47.953 -56.336
## - lweight 1 4.6553 48.431 -55.373
## - lcavol 1 22.7555 66.531 -24.572
##
## Step: AIC=-63.72
## lpsa ~ lcavol + lweight + age + lbph + svi
##
## Df Sum of Sq RSS AIC
## <none> 44.437 -63.723
## - age 1 1.1588 45.595 -63.226
## - lbph 1 1.5087 45.945 -62.484
## - lweight 1 4.3140 48.751 -56.735
## - svi 1 5.8509 50.288 -53.724
## - lcavol 1 25.9427 70.379 -21.119
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph svi
## 0.49473 0.54400 0.58821 -0.01644 0.10122 0.71490
stepDetails <- step(lm(lpsa ~ ., data=prostate), trace=0)$anova
stepDetails$AIC[2] - stepDetails$AIC[1]
## [1] -1.889365
## [1] -1.889365
getPred <- function(parS, xx) {
## parS: parameter list
## xx: x values (predictors)
parS$A * sin(xx + parS$B) + parS$C
}
residFun <- function(p, observed, xx) observed - getPred(p,xx)
parStart <- list(A=1, B=3, C=0)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y, xx = x)
nls.out
## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: -1.03727642432862, 3.18141544265264, -0.0260329085154092
## residual sum-of-squares: 3.783
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
getPred <- function(parS, xx) {
parS$A * exp(parS$B * xx)
}
residFun <- function(p, observed, xx) observed - getPred(p,xx)
parStart <- list(A=3, B=3)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y, xx = x)
nls.out
## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: 0.963577134214599, 1.00769428203048
## residual sum-of-squares: 95.74
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
How to parametrize the set of splines with knots at \(t_1, \ldots, t_m\)?
Given input \(x_1, \ldots, x_n\) and output \(y_1, \ldots, y_n\), we consider to fit function \(f\) with \(k^{th}\) order splines with given knots at \(t_1, \ldots, t_m\) \[f(x) = \sum_{j=1}^{m+k+1} \beta_j g_j(x),\] where \(\beta_j, 1\le j \le m+k+1\) are coefficients and \(g_j, 1\le j \le m+k+1\) are basis functions for order \(k\) spline on the knots \(t_1, \ldots, t_m\) (e.g. truncated power basis or B-spline basis).
define \(G \in \mathbb{R}^{n \times (m+k+1)}\), where \(G_{ij} = g_j(x_i)\).
The solution coefficients \(\hat{\beta} = (\hat{\beta}_1, \ldots, \hat{\beta}_{m+k+1})\)
\[\hat{\beta} = \arg \min_{\beta \in \mathbb{R}^{m+k+1}} \|y - G\beta\|_2^2\]
The problem with regression splines is that the estimates have high variance at the boundaries of the input domain.
\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f^{((k+1)/2)})^2 dx\]
- first term is least square loss
- second term penalized if the devirative of $f$ is too wiggly.
\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f''(x))^2 dx\]
plot(disp ~ mpg, data = mtcars, main = "data(mtcars) & smoothing splines")
cars.spl <- with(mtcars, smooth.spline(mpg, disp))
cars.spl
## Call:
## smooth.spline(x = mpg, y = disp)
##
## Smoothing Parameter spar= 0.9009065 lambda= 0.02109193 (11 iterations)
## Equivalent Degrees of Freedom (Df): 3.211475
## Penalized Criterion (RSS): 61580.33
## GCV: 3828.877