Zhiguang Huo (Caleb)
Monday October 2, 2017
\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \varepsilon, \text{with }\varepsilon \sim N(0, \sigma^2)\]
Assumptions:
\[Y = X\beta + \varepsilon\]
Assumptions:
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable
Basic syntax for lm()
lm(formula, data)
data: Optional dataframe containing the variables in the model
plot() function output model diagnostic results
lmfit <- lm(mpg ~ cyl, data=mtcars)
lmfit
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Coefficients:
## (Intercept) cyl
## 37.885 -2.876
summary(lmfit)
##
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.9814 -2.1185 0.2217 1.0717 7.5186
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
## cyl -2.8758 0.3224 -8.92 6.11e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
## F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
plot(lmfit)
The data is from the book element of statistical learning
library(ElemStatLearn)
str(prostate)
## 'data.frame': 97 obs. of 10 variables:
## $ lcavol : num -0.58 -0.994 -0.511 -1.204 0.751 ...
## $ lweight: num 2.77 3.32 2.69 3.28 3.43 ...
## $ age : int 50 58 74 58 62 50 64 58 47 63 ...
## $ lbph : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ svi : int 0 0 0 0 0 0 0 0 0 0 ...
## $ lcp : num -1.39 -1.39 -1.39 -1.39 -1.39 ...
## $ gleason: int 6 6 7 6 6 6 6 6 6 6 ...
## $ pgg45 : int 0 0 20 0 0 0 0 0 0 0 ...
## $ lpsa : num -0.431 -0.163 -0.163 -0.163 0.372 ...
## $ train : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
prostate$train <- NULL
Explanation of these variables:
head(prostate)
## lcavol lweight age lbph svi lcp gleason pgg45 lpsa
## 1 -0.5798185 2.769459 50 -1.386294 0 -1.386294 6 0 -0.4307829
## 2 -0.9942523 3.319626 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 3 -0.5108256 2.691243 74 -1.386294 0 -1.386294 7 20 -0.1625189
## 4 -1.2039728 3.282789 58 -1.386294 0 -1.386294 6 0 -0.1625189
## 5 0.7514161 3.432373 62 -1.386294 0 -1.386294 6 0 0.3715636
## 6 -1.0498221 3.228826 50 -1.386294 0 -1.386294 6 0 0.7654678
par(mfrow=c(3,3))
for(i in 1:9){
aname <- names(prostate)[i]
hist(prostate[[i]], main=aname, col="steelblue", breaks=20)
}
What can we learn from the histogram.
cor_prostate <- cor(prostate)
max(cor_prostate[cor_prostate!=1])
## [1] 0.7519045
maxID <- arrayInd(which(max(cor_prostate[cor_prostate!=1]) == cor_prostate), dim(cor_prostate))
names(prostate[maxID[1,]])
## [1] "pgg45" "gleason"
heatmap(cor_prostate, scale="none") ## instead of image. Rowv = F, Colv = F to suppress hierarchical clusteirng dendrogram
pairs(prostate,pch=19)
Syntax | function |
---|---|
+ | Separate main effects |
: | Denote interactions |
* | main effects and interactions |
ˆn | include all main effects and n-order interactions |
- | remove the specified terms |
I() | Brackets the portions of a formula where operators are used mathematically |
. | Main effect for each column in the dataframe, except the response |
lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
lm_prostate
##
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
##
## Coefficients:
## (Intercept) age lweight
## -1.897709 0.003318 1.147487
class(lm_prostate)
## [1] "lm"
names(lm_prostate)
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
function | application |
---|---|
summary() | summary of linear model |
coef() | get coefficient |
confint() | Confidence intervals for model parameters |
predict() | make a prediction |
fitted() | get fitted value |
summary(lm_prostate)
##
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.2665 -0.6753 -0.0361 0.5317 2.9712
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.897709 1.119033 -1.696 0.0932 .
## age 0.003318 0.015369 0.216 0.8295
## lweight 1.147487 0.267094 4.296 4.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.051 on 94 degrees of freedom
## Multiple R-squared: 0.1882, Adjusted R-squared: 0.1709
## F-statistic: 10.89 on 2 and 94 DF, p-value: 5.558e-05
coef(lm_prostate)
## (Intercept) age lweight
## -1.897708604 0.003318361 1.147487156
confint(lm_prostate)
## 2.5 % 97.5 %
## (Intercept) -4.11957499 0.3241578
## age -0.02719768 0.0338344
## lweight 0.61716615 1.6778082
anew <- data.frame(age = 29, lweight = 100)
predict(lm_prostate, anew)
## 1
## 112.9472
head(fitted(lm_prostate))
## 1 2 3 4 5 6
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246
Model | Interpretation |
---|---|
y ~ 1 | Just the intercept |
y ~ a | One main effect |
y ~ - 1 + a | No intercept |
y ~ a+b+c+a:b | Three main effects and an interaction between a and b |
y ~ a * b | All main effects and interactions (same as a+b+a:b) |
y ~ factor(a) | Create dummy variables for a (if not already a factor) |
y ~ (a+b+c)ˆ2 | All main effects and second-order interactions |
y ~ (a+b+c)ˆ2 - a:b | All main effects and second-order interactions except a:b |
y ~ I(aˆ2) | Transform a to aˆ2 |
log(y) ~ a | Transform a to aˆ2 |
y ~ a/b/c | Factor c nested within factor b within factor a |
y ~ . | Main effect for each column in the dataframe |
lm(lpsa ~ 1, data = prostate)
lm(lpsa ~ age, data = prostate)
lm(lpsa ~ - 1 + age, data = prostate)
lm(lpsa ~ age + lweight + gleason + gleason:lweight, data = prostate)
lm(lpsa ~ gleason*lweight, data = prostate)
lm(lpsa ~ factor(gleason), data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2, data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2 - gleason:lweight, data = prostate)
lm(lpsa ~ I(age^2), data = prostate)
lm(log(age) ~ lpsa, data = prostate)
lm(lpsa ~ ., data = prostate)
lm(lpsa ~ as.factor(svi)/as.factor(gleason), data = prostate)
blm <- lm(lpsa ~ as.factor(gleason), data = prostate)
summary(blm)
##
## Call:
## lm(formula = lpsa ~ as.factor(gleason), data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.07390 -0.57561 -0.02379 0.60166 2.67156
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.7384 0.1730 10.051 < 2e-16 ***
## as.factor(gleason)7 1.1730 0.2205 5.320 7.12e-07 ***
## as.factor(gleason)8 0.4192 1.0377 0.404 0.6872
## as.factor(gleason)9 1.1348 0.4892 2.320 0.0225 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.023 on 93 degrees of freedom
## Multiple R-squared: 0.2388, Adjusted R-squared: 0.2143
## F-statistic: 9.726 on 3 and 93 DF, p-value: 1.208e-05
summary.aov(blm)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(gleason) 3 30.55 10.183 9.726 1.21e-05 ***
## Residuals 93 97.37 1.047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
clm <- aov(lpsa ~ as.factor(gleason), data = prostate)
summary(clm)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(gleason) 3 30.55 10.183 9.726 1.21e-05 ***
## Residuals 93 97.37 1.047
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(clm) ##Multiple comparisons, Tukey’s Honest Significant Difference
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = lpsa ~ as.factor(gleason), data = prostate)
##
## $`as.factor(gleason)`
## diff lwr upr p adj
## 7-6 1.17300730 0.5962186 1.749796 0.0000042
## 8-6 0.41918941 -2.2956321 3.134011 0.9775856
## 9-6 1.13480989 -0.1449692 2.414589 0.1008983
## 8-7 -0.75381789 -3.4544627 1.946827 0.8847334
## 9-7 -0.03819741 -1.2876217 1.211227 0.9998152
## 9-8 0.71562048 -2.2167219 3.647963 0.9192925
dlm <- aov(lpsa ~ as.factor(gleason) + age, data = prostate)
summary(dlm)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(gleason) 3 30.55 10.183 9.646 1.34e-05 ***
## age 1 0.25 0.246 0.233 0.63
## Residuals 92 97.12 1.056
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
function | application |
---|---|
fitted.values() | Returns fitted values |
residuals() | Returns residuals |
rstandard() | Standardized residuals, variance one; residual standardized using overall error variance |
rstudent() | Studentized residuals, variance one; residual standardized using leave-one-out measure of the error variance |
qqnorm() | Normal quantile plot |
qqline() | Add a line to the normal quantile plot |
plot() | Given a lm object produces six diagnostic plots, selected using the which argument; default is plots 1-3 and 5 |
lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
head(fitted(lm_prostate))
## 1 2 3 4 5 6
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246
head(residuals(lm_prostate))
## 1 2 3 4 5 6
## -1.876911 -2.266503 -1.598536 -2.224233 -1.875070 -1.207778
plot(lm_prostate, 1)
set.seed(32611)
qqnorm(rnorm(1000))
set.seed(32611)
qqnorm(rnorm(1000, sd=10))
set.seed(32611)
qqnorm(1:1000)
set.seed(32611)
qqnorm(rexp(1000))
lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
plot(lm_prostate, 2)
set.seed(32611)
shapiro.test(rnorm(1000))
##
## Shapiro-Wilk normality test
##
## data: rnorm(1000)
## W = 0.99873, p-value = 0.7059
shapiro.test(1:1000)
##
## Shapiro-Wilk normality test
##
## data: 1:1000
## W = 0.95481, p-value < 2.2e-16
shapiro.test(residuals(lm_prostate))
##
## Shapiro-Wilk normality test
##
## data: residuals(lm_prostate)
## W = 0.98133, p-value = 0.1836
plot(lm_prostate, 3)
plot(lm_prostate, 5)
plot(lm_prostate, 4)
lm_all <- lm(lpsa ~ ., data=prostate) ## . Rest of all the other vaiables.
summary(lm_all)
##
## Call:
## lm(formula = lpsa ~ ., data = prostate)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.76644 -0.35510 -0.00328 0.38087 1.55770
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.181561 1.320568 0.137 0.89096
## lcavol 0.564341 0.087833 6.425 6.55e-09 ***
## lweight 0.622020 0.200897 3.096 0.00263 **
## age -0.021248 0.011084 -1.917 0.05848 .
## lbph 0.096713 0.057913 1.670 0.09848 .
## svi 0.761673 0.241176 3.158 0.00218 **
## lcp -0.106051 0.089868 -1.180 0.24115
## gleason 0.049228 0.155341 0.317 0.75207
## pgg45 0.004458 0.004365 1.021 0.31000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared: 0.6634, Adjusted R-squared: 0.6328
## F-statistic: 21.68 on 8 and 88 DF, p-value: < 2.2e-16
AIC(lm_all)
## [1] 216.4952
BIC(lm_all)
## [1] 242.2423
step(lm(lpsa ~ ., data=prostate)) ## default is backward selection
## Start: AIC=-60.78
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason +
## pgg45
##
## Df Sum of Sq RSS AIC
## - gleason 1 0.0491 43.108 -62.668
## - pgg45 1 0.5102 43.569 -61.636
## - lcp 1 0.6814 43.740 -61.256
## <none> 43.058 -60.779
## - lbph 1 1.3646 44.423 -59.753
## - age 1 1.7981 44.857 -58.810
## - lweight 1 4.6907 47.749 -52.749
## - svi 1 4.8803 47.939 -52.364
## - lcavol 1 20.1994 63.258 -25.467
##
## Step: AIC=-62.67
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
##
## Df Sum of Sq RSS AIC
## - lcp 1 0.6684 43.776 -63.176
## <none> 43.108 -62.668
## - pgg45 1 1.1987 44.306 -62.008
## - lbph 1 1.3844 44.492 -61.602
## - age 1 1.7579 44.865 -60.791
## - lweight 1 4.6429 47.751 -54.746
## - svi 1 4.8333 47.941 -54.360
## - lcavol 1 21.3191 64.427 -25.691
##
## Step: AIC=-63.18
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
##
## Df Sum of Sq RSS AIC
## - pgg45 1 0.6607 44.437 -63.723
## <none> 43.776 -63.176
## - lbph 1 1.3329 45.109 -62.266
## - age 1 1.4878 45.264 -61.934
## - svi 1 4.1766 47.953 -56.336
## - lweight 1 4.6553 48.431 -55.373
## - lcavol 1 22.7555 66.531 -24.572
##
## Step: AIC=-63.72
## lpsa ~ lcavol + lweight + age + lbph + svi
##
## Df Sum of Sq RSS AIC
## <none> 44.437 -63.723
## - age 1 1.1588 45.595 -63.226
## - lbph 1 1.5087 45.945 -62.484
## - lweight 1 4.3140 48.751 -56.735
## - svi 1 5.8509 50.288 -53.724
## - lcavol 1 25.9427 70.379 -21.119
##
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
##
## Coefficients:
## (Intercept) lcavol lweight age lbph
## 0.49473 0.54400 0.58821 -0.01644 0.10122
## svi
## 0.71490
stepDetails <- step(lm(lpsa ~ ., data=prostate), trace=0)$anova
stepDetails$AIC[2] - stepDetails$AIC[1]
## [1] -1.889365
AIC(lm(lpsa ~ . - gleason, data=prostate)) - AIC(lm(lpsa ~ ., data=prostate))
## [1] -1.889365
library(minpack.lm)
## Warning: package 'minpack.lm' was built under R version 3.3.2
## simulate data from sin function
set.seed(32611)
n <- 100
x <- runif(n,0,2*pi)
y <- sin(x) + rnorm(n,sd=0.2)
plot(x,y)
getPred <- function(parS, xx) {
parS$A * sin(xx + parS$B) + parS$C
}
residFun <- function(p, observed, xx) observed - getPred(p,xx)
parStart <- list(A=1, B=3, C=0)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y, xx = x)
nls.out
## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: -1.03727642432862, 3.18141544265264, -0.0260329085154092
## residual sum-of-squares: 3.783
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
aseq <- seq(0,2*pi,length.out = 1000)
plot(x,y)
with(nls.out$par, lines(aseq, A*sin(aseq + B) + C, col=2))
set.seed(32611)
n <- 100;
x <- runif(n,0,5)
y <- exp(x) + rnorm(n,sd=1)
plot(x,y)
getPred <- function(parS, xx) {
parS$A * exp(parS$B * xx)
}
residFun <- function(p, observed, xx) observed - getPred(p,xx)
parStart <- list(A=3, B=3)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y, xx = x)
nls.out
## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: 0.963577134214599, 1.00769428203048
## residual sum-of-squares: 95.74
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.
aseq <- seq(0,5,length.out = 1000)
plot(x,y)
with(nls.out$par, lines(aseq, A*exp(B*aseq), col=2))
How to parametrize the set of splines with knots at \(t_1, \ldots, t_m\)?
Given input \(x_1, \ldots, x_n\) and output \(y_1, \ldots, y_n\), we consider to fit function \(f\) with \(k^{th}\) order splines with given knots at \(t_1, \ldots, t_m\) \[f(x) = \sum_{j=1}^{m+k+1} \beta_j g_j(x),\] where \(\beta_j, 1\le j \le m+k+1\) are coefficients and \(g_j, 1\le j \le m+k+1\) are basis functions for order \(k\) spline on the knots \(t_1, \ldots, t_m\) (e.g. truncated power basis or B-spline basis).
The solution coefficients \(\hat{\beta} = (\hat{\beta}_1, \ldots, \hat{\beta}_{m+k+1})\)
\[\hat{\beta} = \arg \min_{\beta \in \mathbb{R}^{m+k+1}} \|y - G\beta\|_2^2\]
The problem with regression splines is that the estimates have high variance at the boundaries of the input domain.
\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f^{((k+1)/2)})^2 dx\] - first term is least square loss - second term penalized if the devirative of \(f\) is too wiggly.
\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f''(x))^2 dx\] - Remarkably, it happens that the minimizer of general smoothing spline is a natural \(k^{th}\) order spline with knots at input points \(x_1, \ldots, x_n\).
plot(disp ~ mpg, data = mtcars, main = "data(mtcars)")
plot(disp ~ mpg, data = mtcars, main = "data(mtcars) & smoothing splines")
cars.spl <- with(mtcars, smooth.spline(mpg, disp))
cars.spl
## Call:
## smooth.spline(x = mpg, y = disp)
##
## Smoothing Parameter spar= 0.9009065 lambda= 0.02109193 (11 iterations)
## Equivalent Degrees of Freedom (Df): 3.211475
## Penalized Criterion: 61580.33
## GCV: 3828.877
lines(cars.spl, col = "blue")
knitr::purl("linearModel.Rmd", output = "linearModel.R ", documentation = 2)
##
##
## processing file: linearModel.Rmd
##
|
| | 0%
|
|. | 1%
|
|.. | 2%
|
|.. | 3%
|
|... | 5%
|
|.... | 6%
|
|..... | 7%
|
|..... | 8%
|
|...... | 9%
|
|....... | 10%
|
|........ | 12%
|
|........ | 13%
|
|......... | 14%
|
|.......... | 15%
|
|........... | 16%
|
|........... | 17%
|
|............ | 19%
|
|............. | 20%
|
|.............. | 21%
|
|.............. | 22%
|
|............... | 23%
|
|................ | 24%
|
|................. | 26%
|
|................. | 27%
|
|.................. | 28%
|
|................... | 29%
|
|.................... | 30%
|
|.................... | 31%
|
|..................... | 33%
|
|...................... | 34%
|
|....................... | 35%
|
|....................... | 36%
|
|........................ | 37%
|
|......................... | 38%
|
|.......................... | 40%
|
|.......................... | 41%
|
|........................... | 42%
|
|............................ | 43%
|
|............................. | 44%
|
|............................. | 45%
|
|.............................. | 47%
|
|............................... | 48%
|
|................................ | 49%
|
|................................ | 50%
|
|................................. | 51%
|
|.................................. | 52%
|
|................................... | 53%
|
|.................................... | 55%
|
|.................................... | 56%
|
|..................................... | 57%
|
|...................................... | 58%
|
|....................................... | 59%
|
|....................................... | 60%
|
|........................................ | 62%
|
|......................................... | 63%
|
|.......................................... | 64%
|
|.......................................... | 65%
|
|........................................... | 66%
|
|............................................ | 67%
|
|............................................. | 69%
|
|............................................. | 70%
|
|.............................................. | 71%
|
|............................................... | 72%
|
|................................................ | 73%
|
|................................................ | 74%
|
|................................................. | 76%
|
|.................................................. | 77%
|
|................................................... | 78%
|
|................................................... | 79%
|
|.................................................... | 80%
|
|..................................................... | 81%
|
|...................................................... | 83%
|
|...................................................... | 84%
|
|....................................................... | 85%
|
|........................................................ | 86%
|
|......................................................... | 87%
|
|......................................................... | 88%
|
|.......................................................... | 90%
|
|........................................................... | 91%
|
|............................................................ | 92%
|
|............................................................ | 93%
|
|............................................................. | 94%
|
|.............................................................. | 95%
|
|............................................................... | 97%
|
|............................................................... | 98%
|
|................................................................ | 99%
|
|.................................................................| 100%
## output file: linearModel.R
## [1] "linearModel.R "