linear regression model (population version)

\(Y \in \mathbb{R}\), \(X_j \in \mathbb{R}\) with \(j = \{1, \ldots, p\}\), \(\beta_j \in \mathbb{R}\) with \(j = \{0, \ldots, p\}\), \(\varepsilon \in \mathbb{R}\)

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \varepsilon, \text{with }\varepsilon \sim N(0, \sigma^2)\]

Y: outcome variable, response, or dependent variable.
\(X_1, X_2, \ldots, X_p\) are called predictors, or independent variables.
\(\beta_0, \beta_1, \ldots, \beta_p\) are called coefficients, for which we will estimate.

Assumptions:

\(Y\) is a linear combination of \(X_1, X_2, \ldots, X_p\).
\(\varepsilon \sim N(0, \sigma^2)\).

linear regression model (sample version)

\(Y = (Y_1, \ldots, Y_n)^\top \in \mathbb{R}^n\)
\(X = (1_n, X_1, \ldots, X_p) \in \mathbb{R}^{n \times (p+1)}\)
- \(1_n \in \mathbb{R}^n\),
- \(X_j \in \mathbb{R}^n\) with \(1 \le j \le p\)
\(\beta = (\beta_0, \beta_1, \ldots, \beta_p)^\top \in \mathbb{R}^{p+1}\)
\(\varepsilon = (\varepsilon_1, \ldots, \varepsilon_n)^\top \in \mathbb{R}^n\)

\[Y = X\beta + \varepsilon\]

Assumptions:

\(Y\) is a linear combination of \(1_n, X_1, X_2, \ldots, X_p\).
\(\varepsilon_i \sim N(0, \sigma^2)\).
\(\varepsilon_i\)’s are independently identically distributed (iid).

Solution to linear regression model

least square estimator (LS): \[\hat{\beta} = \arg \min_\beta \frac{1}{2}\| Y - X\beta\|_2^2\]
- \(\|a\|_2 = \sqrt{a_1^2 + \ldots + a_p^2}\).
Maximum likelihood estimator (MLE): \[\hat{\beta} = \arg \max_\beta L(\beta; X, Y)\]
For linear regression model, LS estimator is the same as MLE \[\hat{\beta} = (X^\top X)^{-1} X^\top Y\] Assuming \(X^\top X\) is invertable

Fit linear model in R, lm()

Basic syntax for lm()

lm(formula, data)

formula: Symbolic description of the model
data: Optional dataframe containing the variables in the model
summary() function summarize the linear model result
plot() function output model diagnostic results

lmfit <- lm(mpg ~ cyl, data=mtcars)
lmfit

## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Coefficients:
## (Intercept)          cyl  
##      37.885       -2.876

summary(lmfit)

## 
## Call:
## lm(formula = mpg ~ cyl, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9814 -2.1185  0.2217  1.0717  7.5186 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  37.8846     2.0738   18.27  < 2e-16 ***
## cyl          -2.8758     0.3224   -8.92 6.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.206 on 30 degrees of freedom
## Multiple R-squared:  0.7262, Adjusted R-squared:  0.7171 
## F-statistic: 79.56 on 1 and 30 DF,  p-value: 6.113e-10

Interpretation:
- Unit increase in cyl will result in 2.88 decrease in mpg.
- p-value of cyl < 0.05 indicates cyl is a significant predictor.

par(mfrow=c(2,2))
plot(lmfit)

Prostate cancer data

The data is from the book element of statistical learning

library(ElemStatLearn)
str(prostate)

## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

prostate$train <- NULL

Explanation of these variables:

lpsa: log PSA score
lcavol: log cancer volume
lweight: log prostate weight
age: age
lbph: log of the amount of benign prostatic hyperplasia
svi: seminal vesicle invasion
lcp: log of capsular penetration
gleason: Gleason score
pgg45: percent of Gleason scores 4 or 5.
train: a logical vector indicating training data or testing data (ignore at this moment)

head(prostate)

##       lcavol  lweight age      lbph svi       lcp gleason pgg45       lpsa
## 1 -0.5798185 2.769459  50 -1.386294   0 -1.386294       6     0 -0.4307829
## 2 -0.9942523 3.319626  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 3 -0.5108256 2.691243  74 -1.386294   0 -1.386294       7    20 -0.1625189
## 4 -1.2039728 3.282789  58 -1.386294   0 -1.386294       6     0 -0.1625189
## 5  0.7514161 3.432373  62 -1.386294   0 -1.386294       6     0  0.3715636
## 6 -1.0498221 3.228826  50 -1.386294   0 -1.386294       6     0  0.7654678

Exploratory analysis

Distribution of each varaible.
Correlation between each pair of vairables.
Visualizing the scattered plot of each pair of variables.

Distribution of each varaible – Histogram

par(mfrow=c(3,3))
for(i in 1:9){
  aname <- names(prostate)[i]
  hist(prostate[[i]], main=aname, col="steelblue", breaks=20)
}

What can we learn from the histogram.

lcp, the log amount of capsular penetration, is very skewed, a bunch of men with little, then a big spread
svi (the presence of seminal vesicle invasion) is binary
gleason score, takes integer values of 6 and larger;
lspa, the log PSA score, is likely to be normally distributed

Correlation between each pairs of variables

cor_prostate <- cor(prostate)
max(cor_prostate[cor_prostate!=1])

## [1] 0.7519045

maxID <- arrayInd(which(max(cor_prostate[cor_prostate!=1]) == cor_prostate)[1], dim(cor_prostate))
names(prostate)[maxID]

## [1] "pgg45"   "gleason"

heatmap(cor_prostate, scale="none") ## instead of image. Rowv = F, Colv = F to suppress hierarchical clusteirng dendrogram

visualizing the scattered plot of each pair of variables.

pairs(prostate,pch=19)

Formulas

Basic form of a formula:
- response ~ model
Formula notation:

Syntax	function
+	Separate main effects
:	Denote interactions
*	main effects and interactions
ˆn	include all main effects and n-order interactions
-	remove the specified terms
I()	Brackets the portions of a formula where operators are used mathematically
.	Main effect for each column in the dataframe, except the response

Fit linear regression model

lpsa as indepedent variable/outcome variable
age and lweight as dependent variable/predicvors

lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
lm_prostate

## 
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
## 
## Coefficients:
## (Intercept)          age      lweight  
##   -1.897709     0.003318     1.147487

class(lm_prostate)

## [1] "lm"

names(lm_prostate)

##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Inference for Linear Models

function	application
summary()	summary of linear model
coef()	get coefficient
confint()	Confidence intervals for model parameters
predict()	make a prediction
fitted()	get fitted value, (e.g., \(X\hat{\beta}\))

summary(lm_prostate)

## 
## Call:
## lm(formula = lpsa ~ age + lweight, data = prostate)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2665 -0.6753 -0.0361  0.5317  2.9712 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.897709   1.119033  -1.696   0.0932 .  
## age          0.003318   0.015369   0.216   0.8295    
## lweight      1.147487   0.267094   4.296 4.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.051 on 94 degrees of freedom
## Multiple R-squared:  0.1882, Adjusted R-squared:  0.1709 
## F-statistic: 10.89 on 2 and 94 DF,  p-value: 5.558e-05

coef(lm_prostate)

##  (Intercept)          age      lweight 
## -1.897708604  0.003318361  1.147487156

confint(lm_prostate)

##                   2.5 %    97.5 %
## (Intercept) -4.11957499 0.3241578
## age         -0.02719768 0.0338344
## lweight      0.61716615 1.6778082

anew <- data.frame(age = 29, lweight = 100)
predict(lm_prostate, anew)

##        1 
## 112.9472

head(fitted(lm_prostate))

##        1        2        3        4        5        6 
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246

Formulas (continue)

Sample formulas, for a model with response y and predictors a, b and c

Model	Interpretation
y ~ 1	Just the intercept
y ~ a	One main effect
y ~ - 1 + a	No intercept
y ~ a+b+c+a:b	Three main effects and an interaction between a and b
y ~ a * b	All main effects and interactions (same as a+b+a:b)
y ~ factor(a)	Create dummy variables for a (if not already a factor)
y ~ (a+b+c)ˆ2	All main effects and second-order interactions
y ~ (a+b+c)ˆ2 - a:b	All main effects and second-order interactions except a:b
y ~ I(aˆ2)	Transform a to aˆ2
log(y) ~ a	Transform y to log(y)
y ~ .	Main effect for each column in the dataframe

Demonstrate other formula of lm using prostate cancer data

lm(lpsa ~ 1, data = prostate)
lm(lpsa ~ age, data = prostate)
lm(lpsa ~ - 1 + age, data = prostate)
lm(lpsa ~ age + lweight + gleason + gleason:lweight, data = prostate)
lm(lpsa ~ gleason*lweight, data = prostate)
lm(lpsa ~ factor(gleason), data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2, data = prostate)
lm(lpsa ~ (age + lweight + gleason)^2 - gleason:lweight, data = prostate)
lm(lpsa ~ I(age^2), data = prostate)
lm(log(age) ~ lpsa, data = prostate)
lm(lpsa ~ ., data = prostate)

Anova (1)

If the only predictor is categorical, lm is the same as anova

blm <- lm(lpsa ~ as.factor(gleason), data = prostate)
summary(blm)

## 
## Call:
## lm(formula = lpsa ~ as.factor(gleason), data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.07390 -0.57561 -0.02379  0.60166  2.67156 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           1.7384     0.1730  10.051  < 2e-16 ***
## as.factor(gleason)7   1.1730     0.2205   5.320 7.12e-07 ***
## as.factor(gleason)8   0.4192     1.0377   0.404   0.6872    
## as.factor(gleason)9   1.1348     0.4892   2.320   0.0225 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.023 on 93 degrees of freedom
## Multiple R-squared:  0.2388, Adjusted R-squared:  0.2143 
## F-statistic: 9.726 on 3 and 93 DF,  p-value: 1.208e-05

summary.aov(blm)

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.726 1.21e-05 ***
## Residuals          93  97.37   1.047                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova (2)

clm <- aov(lpsa ~ as.factor(gleason), data = prostate)
summary(clm)

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.726 1.21e-05 ***
## Residuals          93  97.37   1.047                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(clm) ##Multiple comparisons, Tukey’s Honest Significant Difference

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = lpsa ~ as.factor(gleason), data = prostate)
## 
## $`as.factor(gleason)`
##            diff        lwr      upr     p adj
## 7-6  1.17300730  0.5962186 1.749796 0.0000042
## 8-6  0.41918941 -2.2956321 3.134011 0.9775856
## 9-6  1.13480989 -0.1449692 2.414589 0.1008983
## 8-7 -0.75381789 -3.4544627 1.946827 0.8847334
## 9-7 -0.03819741 -1.2876217 1.211227 0.9998152
## 9-8  0.71562048 -2.2167219 3.647963 0.9192925

ANCOVA

ANOVA adjusting for covariates

dlm <- aov(lpsa ~ as.factor(gleason) + age, data = prostate)
summary(dlm)

##                    Df Sum Sq Mean Sq F value   Pr(>F)    
## as.factor(gleason)  3  30.55  10.183   9.646 1.34e-05 ***
## age                 1   0.25   0.246   0.233     0.63    
## Residuals          92  97.12   1.056                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Model Diagnostics

function	application
fitted.values()	Returns fitted values
residuals()	Returns residuals
rstandard()	Standardized residuals, variance one; residual standardized using overall error variance
rstudent()	Studentized residuals, variance one; residual standardized using leave-one-out measure of the error variance
qqnorm()	Normal quantile plot
qqline()	Add a line to the normal quantile plot
plot()	Given a lm object produces six diagnostic plots, selected using the which argument; default is plots 1-3 and 5
vif()	Variance Inflation Factors, to check collinearity

fitted value and residuals

lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)

fitted value: \[X\hat{\beta}\]

head(fitted(lm_prostate))

##        1        2        3        4        5        6 
## 1.446128 2.103985 1.436017 2.061715 2.246634 1.973246

residuals: \[Y - X\hat{\beta}\]

head(residuals(lm_prostate))

##         1         2         3         4         5         6 
## -1.876911 -2.266503 -1.598536 -2.224233 -1.875070 -1.207778

plot() on lm object options

Residual versus fitted values
Normal quantile-quantile plot
sqrt(Standardized residuals) versus fitted values
Cook’s distance versus row labels
Standardized residuals versus leverage along with contours of Cook’s distance
Cook’s distance versus leverage/(1-leverage) with sqrt(Standardized residuals) contours

Assess Gaussian assumption.

We can use residuals to approximate errors.
Therefore we can check if
- errors are normally distributed
- When normally distributed, if common standard deviation across fitted value

plot(lm_prostate, 1)

QQ-plot

QQ-plot is what’s called a quantile-quantile plot.
Comparing two probability distributions by plotting their quantiles against each other.
- Quantile of x: percentage of data that are smaller than x.
A point (x, y) in QQ-plot:
- x: value at the rth quantile of the first distribution.
- y: value at the rth quantile of the second distribution.
- rth: same quantile.

QQ-plot 0

Normal distribution from sampling vs theoritical normal distribution
- a: rnorm(1000)
- b: \(N(0,1)\)

set.seed(32611)
a0 <- rnorm(1000)
a <- sort(a0)
quantile_a <- seq_along(a) / (length(a) + 1) ## quantiles of the data from sampling
b <- qnorm(quantile_a)  ## data from the therotical normal distribution at the same quantiles

plot(b, a) ## QQ-plot

A straight line indicates two distributions agree with each other

QQ-plot 1

qqnorm: QQ-plot for a sampling distribution against the theoritical standard normal distribution.

set.seed(32611)
a <- rnorm(1000)
qqnorm(a)
abline(a = 0,b = 1, col=2) ## x = y line

In this example, a straight line (x=y) for QQ-plot shows the sampling distribution is likely to be \(N(0,1)\)).

QQ-plot 2

set.seed(32611)
qqnorm(rnorm(1000, sd=10))
abline(a = 0,b = 1, col=2) ## x = y line

A straight line for QQ-plot shows the sampling distribution is likely to be normally distributed.
If the slope is different from 1, then this is not a standard normal distribution, but still a normal distribution.

QQ-plot 3

set.seed(32611)
qqnorm(1:1000)
abline(a = 0,b = 1, col=2) ## x = y line

QQ-plot 4

set.seed(32611)
qqnorm(rexp(1000)) 
abline(a = 0,b = 1, col=2) ## x = y line

Prostate cancer example

lm_prostate <- lm(lpsa ~ age + lweight, data = prostate)
plot(lm_prostate, 2)

Formal normality test

set.seed(32611)
shapiro.test(rnorm(1000))

## 
##  Shapiro-Wilk normality test
## 
## data:  rnorm(1000)
## W = 0.99873, p-value = 0.7059

shapiro.test(1:1000)

## 
##  Shapiro-Wilk normality test
## 
## data:  1:1000
## W = 0.95481, p-value < 2.2e-16

shapiro.test(residuals(lm_prostate))

## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(lm_prostate)
## W = 0.98133, p-value = 0.1836

Unusual and Influential Data

Outlier: An observation with large residual.
- An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.
Leverage point: An observation with an extreme value on a predictor variable.
- Leverage is a measure of how far an independent variable deviates from its mean.
- These leverage points can have an effect on the estimate of regression coefficients.
Influential point: Influence can be thought of as the product of leverage and outlierness.
- Can be measured by Cook’s distance.
- Removing the observation substantially changes the estimate of coefficients.
- Criteria: Cook’s D > 4/n

Outliers

plot(lm_prostate, 3)

Leverage

plot(lm_prostate, 5)

influential points: Cook’s distance

plot(lm_prostate, 4)

Fit a linear regression model with all variables

lm_all <- lm(lpsa ~ ., data=prostate) ## . Rest of all the other vaiables.
summary(lm_all)

## 
## Call:
## lm(formula = lpsa ~ ., data = prostate)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.76644 -0.35510 -0.00328  0.38087  1.55770 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.181561   1.320568   0.137  0.89096    
## lcavol       0.564341   0.087833   6.425 6.55e-09 ***
## lweight      0.622020   0.200897   3.096  0.00263 ** 
## age         -0.021248   0.011084  -1.917  0.05848 .  
## lbph         0.096713   0.057913   1.670  0.09848 .  
## svi          0.761673   0.241176   3.158  0.00218 ** 
## lcp         -0.106051   0.089868  -1.180  0.24115    
## gleason      0.049228   0.155341   0.317  0.75207    
## pgg45        0.004458   0.004365   1.021  0.31000    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6995 on 88 degrees of freedom
## Multiple R-squared:  0.6634, Adjusted R-squared:  0.6328 
## F-statistic: 21.68 on 8 and 88 DF,  p-value: < 2.2e-16

Variance inflation factor (VIF)

collinearity: if two predictors are highly correlated (very similar to each other)
Variance inflation factor can be used to detect collinearity
VIF > 10 indicates severe collinearity issue.

lm_all <- lm(lpsa ~ ., data=prostate) ## . Rest of all the other vaiables.
car::vif(lm_all)

##   lcavol  lweight      age     lbph      svi      lcp  gleason    pgg45 
## 2.102650 1.453325 1.336099 1.385040 1.955928 3.097954 2.468891 2.974075

Model selection

With many predictors, what combination of predictors provide the best model?
- AIC, BIC criteria
- forward selection and backward selection

AIC, BIC

AIC: Akaike information criterion
\(AIC = 2k - 2 \log ( {\hat{L}} )\)
- k: number of parameters.
- \(\hat{L}\): likelihood given the coefficient estimates

AIC(lm_all)

## [1] 216.4952

BIC: Bayesian information criterion, (though it has noting to do with Bayesian)
\(BIC = log(n)k - 2 \log ( {\hat{L}} )\)

BIC(lm_all)

## [1] 242.2423

Minimize AIC or BIC

backward selection

step() function can perform forward selection, backward selection or both.
- other direction options: c(“both”, “backward”, “forward”)

step(lm(lpsa ~ ., data=prostate)) ## default is backward selection

## Start:  AIC=-60.78
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + gleason + 
##     pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - gleason  1    0.0491 43.108 -62.668
## - pgg45    1    0.5102 43.569 -61.636
## - lcp      1    0.6814 43.740 -61.256
## <none>                 43.058 -60.779
## - lbph     1    1.3646 44.423 -59.753
## - age      1    1.7981 44.857 -58.810
## - lweight  1    4.6907 47.749 -52.749
## - svi      1    4.8803 47.939 -52.364
## - lcavol   1   20.1994 63.258 -25.467
## 
## Step:  AIC=-62.67
## lpsa ~ lcavol + lweight + age + lbph + svi + lcp + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - lcp      1    0.6684 43.776 -63.176
## <none>                 43.108 -62.668
## - pgg45    1    1.1987 44.306 -62.008
## - lbph     1    1.3844 44.492 -61.602
## - age      1    1.7579 44.865 -60.791
## - lweight  1    4.6429 47.751 -54.746
## - svi      1    4.8333 47.941 -54.360
## - lcavol   1   21.3191 64.427 -25.691
## 
## Step:  AIC=-63.18
## lpsa ~ lcavol + lweight + age + lbph + svi + pgg45
## 
##           Df Sum of Sq    RSS     AIC
## - pgg45    1    0.6607 44.437 -63.723
## <none>                 43.776 -63.176
## - lbph     1    1.3329 45.109 -62.266
## - age      1    1.4878 45.264 -61.934
## - svi      1    4.1766 47.953 -56.336
## - lweight  1    4.6553 48.431 -55.373
## - lcavol   1   22.7555 66.531 -24.572
## 
## Step:  AIC=-63.72
## lpsa ~ lcavol + lweight + age + lbph + svi
## 
##           Df Sum of Sq    RSS     AIC
## <none>                 44.437 -63.723
## - age      1    1.1588 45.595 -63.226
## - lbph     1    1.5087 45.945 -62.484
## - lweight  1    4.3140 48.751 -56.735
## - svi      1    5.8509 50.288 -53.724
## - lcavol   1   25.9427 70.379 -21.119

## 
## Call:
## lm(formula = lpsa ~ lcavol + lweight + age + lbph + svi, data = prostate)
## 
## Coefficients:
## (Intercept)       lcavol      lweight          age         lbph          svi  
##     0.49473      0.54400      0.58821     -0.01644      0.10122      0.71490

You may notice that AIC from step() and AIC() are different
AIC/BIC can be different for different methods up to a constant.

stepDetails <- step(lm(lpsa ~ ., data=prostate), trace=0)$anova
stepDetails$AIC[2] - stepDetails$AIC[1]

## [1] -1.889365

AIC(lm(lpsa ~ . - gleason, data=prostate)) - AIC(lm(lpsa ~ ., data=prostate))

## [1] -1.889365

non-linear predictor (1)

library(minpack.lm)

## simulate data from sin function
set.seed(32611)
n <- 100
x <- runif(n,0,2*pi)
y <- sin(x) + rnorm(n,sd=0.2)
plot(x,y)

fit \(y = A \sin(x + B) + C\)

getPred <- function(parS, xx) { 
    ## parS: parameter list
    ## xx: x values (predictors)
        parS$A * sin(xx + parS$B) + parS$C
    }

residFun <- function(p, observed, xx) observed - getPred(p,xx)

parStart <- list(A=1, B=3, C=0)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y,    xx = x)
nls.out

## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: -1.03727642432862, 3.18141544265264, -0.0260329085154092 
## residual sum-of-squares: 3.783
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.

aseq <- seq(0,2*pi,length.out = 1000)
plot(x,y)
with(nls.out$par, lines(aseq, A*sin(aseq + B) + C, col=2))

non-linear predictor (2)

set.seed(32611)
n <- 100; 
x <- runif(n,0,5)
y <- exp(x) + rnorm(n,sd=1)
plot(x,y)

fit \(y = a \times \exp(bx)\)

getPred <- function(parS, xx) { 
        parS$A * exp(parS$B * xx)
    }

residFun <- function(p, observed, xx) observed - getPred(p,xx)

parStart <- list(A=3, B=3)
nls.out <- nls.lm(par=parStart, fn = residFun, observed = y,    xx = x)
nls.out

## Nonlinear regression via the Levenberg-Marquardt algorithm
## parameter estimates: 0.963577134214599, 1.00769428203048 
## residual sum-of-squares: 95.74
## reason terminated: Conditions for `info = 1' and `info = 2' both hold.

aseq <- seq(0,5,length.out = 1000)
plot(x,y)
with(nls.out$par, lines(aseq, A*exp(B*aseq), col=2))

Splines (Optional)

Non-parametric approach to fit the curve (usually dimension \(d=1\)).
These basis functions are splines.
A kth order spline is a piecewise polynomial function of degree k, that is continuous and has continuous derivatives of orders \(1, \ldots, k-1\), at its knot points.

Splines motivating example

continuity at the knots, from book the “element of statistical leaning”

Splines basis

How to parametrize the set of splines with knots at \(t_1, \ldots, t_m\)?

truncated power basis:
- \(g_1(x) = 1\), \(g_2(x) = x\), …, \(g_{k+1}(x) = x^k\)
- \(g_{k+1+j}(x) = (x - t_j)^k_+\), \(j = 1, \ldots, m\)
- \((\cdot)_+ = \max(0, \cdot)\)
B-spline basis https://en.wikipedia.org/wiki/B-spline

Splines types

Regression splines
Natural splines
Smoothing splines

Regression splines

Given input \(x_1, \ldots, x_n\) and output \(y_1, \ldots, y_n\), we consider to fit function \(f\) with \(k^{th}\) order splines with given knots at \(t_1, \ldots, t_m\) \[f(x) = \sum_{j=1}^{m+k+1} \beta_j g_j(x),\] where \(\beta_j, 1\le j \le m+k+1\) are coefficients and \(g_j, 1\le j \le m+k+1\) are basis functions for order \(k\) spline on the knots \(t_1, \ldots, t_m\) (e.g. truncated power basis or B-spline basis).
define \(G \in \mathbb{R}^{n \times (m+k+1)}\), where \(G_{ij} = g_j(x_i)\).
The solution coefficients \(\hat{\beta} = (\hat{\beta}_1, \ldots, \hat{\beta}_{m+k+1})\)

\[\hat{\beta} = \arg \min_{\beta \in \mathbb{R}^{m+k+1}} \|y - G\beta\|_2^2\]

We know the solution is \(\hat{\beta} (G^\top G)^{-1} G^\top y\)

Natural splines

The problem with regression splines is that the estimates have high variance at the boundaries of the input domain.
A solution is to force the piecewise polynomial function have lower degree at the boundary.
- \(f\) is a polynomial of degree \(k\) on each of \([t_1, t_2], \ldots, [t_{m-1}, t_m]\)
- \(f\) is a polynomial of degree \((k-1)/2\) on each of \((-\infty, t_1]\) and \([t_m, \infty)\)

natural spline in R (data points)

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)")

natural spline in R

library(splines)

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)  &  natural splines")
#with(mtcars,  ns(mpg, df = 4))
spl_natural <- with(mtcars, lm(disp ~ ns(mpg, df = 4)))
mtcars_sort <- mtcars[order(mtcars$mpg), ] 
lines(mtcars_sort$mpg, predict(spl_natural, mtcars_sort), col = "blue")

Smoothing splines

placing knots at all inputs \(x_1, \ldots, x_n\), circumvent knot selection problem.

\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f^{((k+1)/2)})^2 dx\]

- first term is least square loss
- second term penalized if the devirative of $f$ is too wiggly.

When \(k = 3\), this is the cubic smoothing splines, which is very commonly used.

\[\hat{f} = \arg\min_f \sum_{i=1}^n (y_i - f(x_i))^2 + \lambda \int_a^b (f''(x))^2 dx\]

Remarkably, it happens that the minimizer of general smoothing spline is a natural \(k^{th}\) order spline with knots at input points \(x_1, \ldots, x_n\).

cubic smoothing spline in R (data points)

smooth.spline()

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)")

cubic smoothing spline in R

plot(disp ~ mpg, data = mtcars, main = "data(mtcars)  &  smoothing splines")
cars.spl <- with(mtcars, smooth.spline(mpg, disp))
cars.spl

## Call:
## smooth.spline(x = mpg, y = disp)
## 
## Smoothing Parameter  spar= 0.9009065  lambda= 0.02109193 (11 iterations)
## Equivalent Degrees of Freedom (Df): 3.211475
## Penalized Criterion (RSS): 61580.33
## GCV: 3828.877

lines(cars.spl, col = "blue")

Biostatistical Computing, PHC 6068

Linear Model

Outlines