Biostatistical Computing, PHC 6068

Linear Mixed Model

Zhiguang Huo (Caleb)

Wednesday October 28, 2020

From linear model to linear mixed model

\[Y = X^\top\beta + \varepsilon\]

voice pitch data

d = read.csv("https://caleb-huo.github.io/teaching/data/politeness/politeness_data.csv")
str(d)
## 'data.frame':    84 obs. of  5 variables:
##  $ subject  : Factor w/ 6 levels "F1","F2","F3",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ gender   : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...
##  $ scenario : int  1 1 2 2 3 3 4 4 5 5 ...
##  $ attitude : Factor w/ 2 levels "inf","pol": 2 1 2 1 2 1 2 1 2 1 ...
##  $ frequency: num  213 204 285 260 204 ...
d$attitude <- as.factor(d$attitude)
d$scenario0 <- as.factor(d$scenario)

voice pitch data

head(d)
##   subject gender scenario attitude frequency scenario0
## 1      F1      F        1      pol     213.3         1
## 2      F1      F        1      inf     204.5         1
## 3      F1      F        2      pol     285.1         2
## 4      F1      F        2      inf     259.7         2
## 5      F1      F        3      pol     203.9         3
## 6      F1      F        3      inf     286.9         3
with(d, table(subject, gender))
##        gender
## subject  F  M
##      F1 14  0
##      F2 14  0
##      F3 14  0
##      M3  0 14
##      M4  0 14
##      M7  0 14
which(apply(d,1,function(x) any(is.na(x)))) ## one subject contails missing data (missing outcome variable)
## [1] 39

visualize the data

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   1.0.1
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
ggplot(aes(attitude, frequency, color = subject), data=d) + facet_grid(~subject) + geom_boxplot() + geom_jitter() + theme_bw()
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 1 rows containing missing values (geom_point).

Fit linear model

\[frequency \sim \mu + politeness + \varepsilon\]

Linear model in R

lmFit <- lm(frequency ~ attitude, data = d)
summary(lmFit)
## 
## Call:
## lm(formula = frequency ~ attitude, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.488  -62.122    9.044   51.178  105.044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   202.59      10.08  20.107   <2e-16 ***
## attitudepol   -18.23      14.34  -1.272    0.207    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.3 on 81 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01958,    Adjusted R-squared:  0.007475 
## F-statistic: 1.618 on 1 and 81 DF,  p-value: 0.2071

Note:

Problem with linear model

\[frequency_{ijk} \sim \mu + politeness_{k} + \varepsilon_{ijk},\] where \(i\) is a subject index, \(j\) is a scenario index, and \(k\) is the politeness index.

Independence:

data independence

subj_spread <- d %>% spread(attitude, frequency)
subj_spread$scenario0 <- as.factor(subj_spread$scenario)
head(subj_spread)
##   subject gender scenario scenario0   inf   pol
## 1      F1      F        1         1 204.5 213.3
## 2      F1      F        2         2 259.7 285.1
## 3      F1      F        3         3 286.9 203.9
## 4      F1      F        4         4 276.8 250.8
## 5      F1      F        5         5 252.4 231.9
## 6      F1      F        6         6 230.7 181.2

Subject dependence

ggplot(data=subj_spread, aes(pol, inf)) + theme_bw() +
  geom_point(aes(col=subject)) + 
  stat_ellipse(aes(col=subject)) +
  geom_smooth(method="lm")
## Warning: Removed 1 rows containing non-finite values (stat_ellipse).
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

scenario independence

ggplot(data=subj_spread, aes(pol, inf)) + theme_bw() +
  geom_point(aes(col=scenario0)) + 
  stat_ellipse(aes(col=scenario0)) +
  geom_smooth(method="lm")
## Warning: Removed 1 rows containing non-finite values (stat_ellipse).
## Warning: Removed 1 rows containing non-finite values (stat_smooth).
## Warning: Removed 1 rows containing missing values (geom_point).

Random effects:

When to use random effects?

Mixed model

Benefit of random effect models

Fit linear mixed model in R

alm <- lm(frequency ~ attitude, data = d)
suppressMessages(library(lme4))
suppressMessages(library(lmerTest)) ## this package helps provide a p-value
tryCatch({lmer(frequency ~ attitude, data = d)}, error=function(e) print(e))
## <simpleError: No random effects terms specified in formula>

Fit linear mixed model in R

rs_subj_reml = lmer(frequency ~ attitude + (1 | subject), data = d)
summary(rs_subj_reml)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: frequency ~ attitude + (1 | subject)
##    Data: d
## 
## REML criterion at convergence: 804.7
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2953 -0.6018 -0.2005  0.4774  3.1772 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 3982     63.10   
##  Residual              851     29.17   
## Number of obs: 83, groups:  subject, 6
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  202.588     26.151   5.150   7.747 0.000501 ***
## attitudepol  -19.376      6.407  76.003  -3.024 0.003399 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## attitudepol -0.121

How to write the linear mixed model mathmatically

\[Y_{ijk} = \mu + \gamma \mathbb{I}(k=pol) + \alpha_i + \varepsilon_{ijk}\]

Further explore rs_subj_reml

coef(alm)
## (Intercept) attitudepol 
##    202.5881    -18.2320
coef(rs_subj_reml)
## $subject
##    (Intercept) attitudepol
## F1    241.1352   -19.37584
## F2    266.8920   -19.37584
## F3    259.5540   -19.37584
## M3    179.0262   -19.37584
## M4    155.6906   -19.37584
## M7    113.2306   -19.37584
## 
## attr(,"class")
## [1] "coef.mer"
AIC(rs_subj_reml)
## [1] 812.7023

Further explore rs_subj_reml

Random Intercepts

Visualization for the random intercept

## Warning: Removed 1 rows containing missing values (geom_point).

Fit linear mixed model in R (REML = FALSE)

rs_subj_ml = lmer(frequency ~ attitude + (1 | subject), REML = FALSE, data = d)
summary(rs_subj_ml)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: frequency ~ attitude + (1 | subject)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    826.5    836.2   -409.3    818.5       79 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3067 -0.6051 -0.2042  0.4793  3.1931 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 3308.7   57.52   
##  Residual              839.9   28.98   
## Number of obs: 83, groups:  subject, 6
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  202.588     23.905   6.214   8.475 0.000122 ***
## attitudepol  -19.372      6.365  77.003  -3.043 0.003199 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr)
## attitudepol -0.131

REML option

REML represents restricted maximum likelihood

Compare to linear model

summary(lm(frequency ~ attitude, data = d))
## 
## Call:
## lm(formula = frequency ~ attitude, data = d)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -103.488  -62.122    9.044   51.178  105.044 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   202.59      10.08  20.107   <2e-16 ***
## attitudepol   -18.23      14.34  -1.272    0.207    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 65.3 on 81 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01958,    Adjusted R-squared:  0.007475 
## F-statistic: 1.618 on 1 and 81 DF,  p-value: 0.2071

Without adjusting for random effect, attitude won’t be a significant predictor.

Alternative ways to get p-values (likelihood ratio test)

rs_subj_ml = lmer(frequency ~ attitude + (1 | subject), REML = FALSE, data = d)
rs_null_ml = lmer(frequency ~ 1 + (1 | subject), REML = FALSE, data = d)

anova(rs_null_ml, rs_subj_ml)
## Data: d
## Models:
## rs_null_ml: frequency ~ 1 + (1 | subject)
## rs_subj_ml: frequency ~ attitude + (1 | subject)
##            Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)   
## rs_null_ml  3 833.25 840.51 -413.62   827.25                            
## rs_subj_ml  4 826.51 836.18 -409.25   818.51 8.7444      1   0.003106 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

add another covariate gender and its interaction with attitude

summary(lmer(frequency ~ attitude * gender +  (1 | subject), data = d))
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: frequency ~ attitude * gender + (1 | subject)
##    Data: d
## 
## REML criterion at convergence: 778.2
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2334 -0.5105 -0.1551  0.4738  3.0875 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept) 605.4    24.61   
##  Residual             844.1    29.05   
## Number of obs: 83, groups:  subject, 6
## 
## Fixed effects:
##                     Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)          260.686     15.557    4.754  16.757 2.07e-05 ***
## attitudepol          -27.400      8.966   75.009  -3.056  0.00311 ** 
## genderM             -116.195     22.000    4.754  -5.282  0.00376 ** 
## attitudepol:genderM   16.190     12.763   75.018   1.269  0.20852    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp gendrM
## attitudepol -0.288              
## genderM     -0.707  0.204       
## atttdpl:gnM  0.202 -0.703 -0.286

Two random effects (two random intercepts)

rs_intergen_subjscene_ml = lmer(frequency ~ attitude + gender + (1 | subject) + 
    (1 | scenario), REML = FALSE, data = d)
summary(rs_intergen_subjscene_ml)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: frequency ~ attitude + gender + (1 | subject) + (1 | scenario)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    807.1    821.6   -397.6    795.1       77 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.2958 -0.6456 -0.0777  0.5449  3.5121 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  scenario (Intercept) 205.2    14.32   
##  subject  (Intercept) 417.1    20.42   
##  Residual             637.4    25.25   
## Number of obs: 83, groups:  scenario, 7; subject, 6
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  256.847     13.828    8.497  18.575 3.55e-08 ***
## attitudepol  -19.722      5.547   70.920  -3.555 0.000677 ***
## genderM     -108.517     17.573    5.928  -6.175 0.000868 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp
## attitudepol -0.201       
## genderM     -0.635  0.004

Two random effects (intercepts)

coef(rs_intergen_subjscene_ml)
## $scenario
##   (Intercept) attitudepol   genderM
## 1    243.4865   -19.72206 -108.5173
## 2    263.3589   -19.72206 -108.5173
## 3    268.1317   -19.72206 -108.5173
## 4    277.2536   -19.72206 -108.5173
## 5    254.9320   -19.72206 -108.5173
## 6    244.8021   -19.72206 -108.5173
## 7    245.9623   -19.72206 -108.5173
## 
## $subject
##    (Intercept) attitudepol   genderM
## F1    243.3681   -19.72206 -108.5173
## F2    266.9444   -19.72206 -108.5173
## F3    260.2277   -19.72206 -108.5173
## M3    284.3541   -19.72206 -108.5173
## M4    262.0576   -19.72206 -108.5173
## M7    224.1286   -19.72206 -108.5173
## 
## attr(,"class")
## [1] "coef.mer"

Randon intercept and random slope

\[Y_{ijk} = (\mu + \alpha_i) + \gamma \mathbb{I}(k=pol) + \varepsilon_{ijk}\]

\[Y_{ijk} = (\mu + \alpha_i) + (\gamma + \eta_i) \mathbb{I}(k=pol) + \varepsilon_{ijk}\]

Visualization for the random slope

## boundary (singular) fit: see ?isSingular
## Warning: Removed 1 rows containing missing values (geom_point).

Random Slopes in R

politeness.model.rs = lmer(frequency ~  attitude + gender + (1 + attitude | subject), REML = FALSE, data = d)
## boundary (singular) fit: see ?isSingular
summary(politeness.model.rs)
## Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
##   method [lmerModLmerTest]
## Formula: frequency ~ attitude + gender + (1 + attitude | subject)
##    Data: d
## 
##      AIC      BIC   logLik deviance df.resid 
##    820.3    837.2   -403.2    806.3       76 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.3848 -0.5270 -0.2221  0.5179  3.2462 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  subject  (Intercept) 359.959  18.97        
##           attitudepol   1.564   1.25    1.00
##  Residual             839.418  28.97        
## Number of obs: 83, groups:  subject, 6
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  257.501     12.396    6.275  20.774 5.06e-07 ***
## attitudepol  -19.408      6.384   63.996  -3.040 0.003422 ** 
## genderM     -109.826     17.167    6.022  -6.398 0.000677 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) atttdp
## attitudepol -0.205       
## genderM     -0.692  0.004
## convergence code: 0
## boundary (singular) fit: see ?isSingular

Random Slopes in R

coef(politeness.model.rs)
## $subject
##    (Intercept) attitudepol   genderM
## F1    244.1800   -20.28634 -109.8262
## F2    266.1361   -18.83927 -109.8262
## F3    259.8312   -19.25481 -109.8262
## M3    283.4803   -17.69615 -109.8262
## M4    263.7697   -18.99523 -109.8262
## M7    227.6097   -21.37845 -109.8262
## 
## attr(,"class")
## [1] "coef.mer"

Is random slope necessary

rs_gen_subjscene_con_reml = lmer(frequency ~ attitude + gender + (1 + attitude | 
    subject), REML = TRUE, data = d)
## boundary (singular) fit: see ?isSingular
rs_gen_subjscene_reml = lmer(frequency ~ attitude + gender + (1 | subject), REML = TRUE, data = d)

anova(rs_gen_subjscene_reml, rs_gen_subjscene_con_reml)
## refitting model(s) with ML (instead of REML)
## Data: d
## Models:
## rs_gen_subjscene_reml: frequency ~ attitude + gender + (1 | subject)
## rs_gen_subjscene_con_reml: frequency ~ attitude + gender + (1 + attitude | subject)
##                           Df    AIC    BIC  logLik deviance  Chisq Chi Df
## rs_gen_subjscene_reml      5 816.34 828.43 -403.17   806.34              
## rs_gen_subjscene_con_reml  7 820.31 837.24 -403.15   806.31 0.0282      2
##                           Pr(>Chisq)
## rs_gen_subjscene_reml               
## rs_gen_subjscene_con_reml      0.986

Beyond linear mixed model

If the outcome variable is

Function lmer is used to fit linear mixed models, function glmer is used to fit generalized (non-Gaussian) linear mixed models.

Another example sleep study

sleepStudy <- read.table("https://caleb-huo.github.io/teaching/data/sleep/sleepstudy.txt", header=T, row.names = 1)
sleepStudy$Subject <- as.factor(sleepStudy$Subject)

head(sleepStudy, n=10)
##    Reaction Days Subject
## 1  249.5600    0     308
## 2  258.7047    1     308
## 3  250.8006    2     308
## 4  321.4398    3     308
## 5  356.8519    4     308
## 6  414.6901    5     308
## 7  382.2038    6     308
## 8  290.1486    7     308
## 9  430.5853    8     308
## 10 466.3535    9     308
str(sleepStudy)
## 'data.frame':    180 obs. of  3 variables:
##  $ Reaction: num  250 259 251 321 357 ...
##  $ Days    : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ Subject : Factor w/ 18 levels "308","309","310",..: 1 1 1 1 1 1 1 1 1 1 ...

Sleep deprivation data

Longitudinal data

Data exploratory

ggplot(aes(Days, Reaction, col=Subject), data = sleepStudy) + geom_point() + geom_smooth(method="lm") + facet_wrap(~Subject)

Fit linear mixed model in R

fm1 <- lmer(Reaction ~ Days + (1|Subject), sleepstudy)
summary(fm1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reaction ~ Days + (1 | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1786.5
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.2257 -0.5529  0.0109  0.5188  4.2506 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  Subject  (Intercept) 1378.2   37.12   
##  Residual              960.5   30.99   
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept) 251.4051     9.7467  22.8102   25.79   <2e-16 ***
## Days         10.4673     0.8042 161.0000   13.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.371

Fit linear mixed model in R

fm1 <- lmer(Reaction ~ Days + (1 + Days|Subject), sleepstudy)
#fm1 <- lmer(Reaction ~ Days + (Days|Subject), sleepstudy) ## same as this one
summary(fm1)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: Reaction ~ Days + (1 + Days | Subject)
##    Data: sleepstudy
## 
## REML criterion at convergence: 1743.6
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.9536 -0.4634  0.0231  0.4633  5.1793 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr
##  Subject  (Intercept) 611.90   24.737       
##           Days         35.08    5.923   0.07
##  Residual             654.94   25.592       
## Number of obs: 180, groups:  Subject, 18
## 
## Fixed effects:
##             Estimate Std. Error      df t value Pr(>|t|)    
## (Intercept)  251.405      6.824  17.005  36.843  < 2e-16 ***
## Days          10.467      1.546  16.995   6.771 3.27e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##      (Intr)
## Days -0.138

coef

coef(fm1)
## $Subject
##     (Intercept)       Days
## 308    253.6626 19.6665597
## 309    211.0108  1.8467699
## 310    212.4488  5.0177063
## 330    275.0940  5.6531411
## 331    273.6636  7.3976093
## 332    260.4439 10.1952325
## 333    268.2441 10.2438881
## 334    244.1731 11.5417935
## 335    251.0724 -0.2851939
## 337    286.2916 19.0963068
## 349    226.1971 11.6403856
## 350    238.3357 17.0815045
## 351    255.9828  7.4520035
## 352    272.2666 14.0036922
## 369    254.6802 11.3395736
## 370    225.7940 15.2895377
## 371    252.2122  9.4791130
## 372    263.7185 11.7515240
## 
## attr(,"class")
## [1] "coef.mer"

References:

Credits give to: