Biostatistical Computing, PHC 6068

Generalized Linear Model

Zhiguang Huo (Caleb)

Monday October 25, 2021

Outline

Review: linear regression model

\[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + \ldots + \beta_pX_p + \varepsilon, \text{with }\varepsilon \sim N(0, \sigma^2)\]

Intepretation of linear model coefficients:

Limitation of linear model:

How about other data types?

Generalize the “linear regression model” (1)

\[Y = X^\top\beta + \varepsilon, \text{with }\varepsilon \sim N(0, \sigma^2)\]

Generalize the “linear regression model” (2)

We can generalize these conditions for linear regression model

Commonly used distribution \(f\) in GLM

logistic regression motivating example

X <- matrix(c(189, 104, 10845, 10933), nrow=2,
            dimnames=list(Treatment=c("Placebo","Aspirin"), 
                          "Myocardial Infarction"=c("Yes", "No")))
X
##          Myocardial Infarction
## Treatment Yes    No
##   Placebo 189 10845
##   Aspirin 104 10933
odds.ratio0 <- function(X){
  result <- X[1,1]*X[2,2]/(X[1,2]*X[2,1])
  return(result)
}
odds.ratio0(X)
## [1] 1.832054

logistic regression motivating example

ID Myocardial Infarction (Yes/No) Treatment(1/0)
1 1 0
2 0 1
3 1 1
4 1 0

logistic regression

Prostate cancer data

The data is from the book element of statistical learning

library(ElemStatLearn)
str(prostate)
## 'data.frame':    97 obs. of  10 variables:
##  $ lcavol : num  -0.58 -0.994 -0.511 -1.204 0.751 ...
##  $ lweight: num  2.77 3.32 2.69 3.28 3.43 ...
##  $ age    : int  50 58 74 58 62 50 64 58 47 63 ...
##  $ lbph   : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ svi    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ lcp    : num  -1.39 -1.39 -1.39 -1.39 -1.39 ...
##  $ gleason: int  6 6 7 6 6 6 6 6 6 6 ...
##  $ pgg45  : int  0 0 20 0 0 0 0 0 0 0 ...
##  $ lpsa   : num  -0.431 -0.163 -0.163 -0.163 0.372 ...
##  $ train  : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
prostate$train <- NULL

Exponential family (1)

Exponential family (2)

The role of \(b(\theta)\) in the exponential family

Comments:

Prove \(E(Y|X) = b'(\theta)\)

Prove \(Var(Y|X) = a(\phi) b''(\theta)\)

interpret logistic regression coefficient \(\beta\)

\[\hat{OR} = \frac{\hat{odds}|_{x_i=1}}{\hat{odds}|_{x_i=0}} = \frac{\frac{P(y_i=1)}{P(y_i=0)}|_{x_i=1}}{\frac{P(y_i=1)}{P(y_i=0)}|_{x_i=0}} = \frac{\frac{\hat{\mu}_i}{1 - \hat{\mu}_i}|_{x_i=1}}{\frac{\hat{\mu}_i}{1 - \hat{\mu}_i}|_{x_i=0}} = \frac{\exp(\hat{\beta}_0 + 1 \times \hat{\beta}_1)}{\exp(\hat{\beta}_0 + 0 \times \hat{\beta}_1)} = \exp(\hat{\beta}_1) \]

interpret logistic regression coefficient \(\beta\) adjusting for other covaraites

\[\hat{OR} = \frac{\hat{odds}|_{x_i=1}}{\hat{odds}|_{x_i=0}} = \frac{\frac{\hat{\mu}_i}{1 - \hat{\mu}_i}|_{x_i=1}}{\frac{\hat{\mu}_i}{1 - \hat{\mu}_i}|_{x_i=0}} = \frac{\exp(\hat{\beta}_0 + 1 \times \hat{\beta}_1 + Z_i^\top\gamma)}{\exp(\hat{\beta}_0 + 0 \times \hat{\beta}_1 + Z_i^\top\gamma)} = \exp(\hat{\beta}_1) \]

logistic regression in R

glm_binomial3 <- glm(svi ~ lcavol + lweight + age, data = prostate,  family = binomial())
summary(glm_binomial3) # display results
## 
## Call:
## glm(formula = svi ~ lcavol + lweight + age, family = binomial(), 
##     data = prostate)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.80253  -0.51277  -0.20249  -0.03868   2.29319  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -6.26605    3.85789  -1.624    0.104    
## lcavol       1.99773    0.46970   4.253 2.11e-05 ***
## lweight     -0.24193    0.91732  -0.264    0.792    
## age          0.03203    0.05097   0.628    0.530    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 101.353  on 96  degrees of freedom
## Residual deviance:  63.742  on 93  degrees of freedom
## AIC: 71.742
## 
## Number of Fisher Scoring iterations: 6
confint(glm_binomial3) # 95% CI for the coefficients
## Waiting for profiling to be done...
##                    2.5 %    97.5 %
## (Intercept) -14.28625935 1.0304960
## lcavol        1.18463130 3.0485710
## lweight      -2.09281024 1.5785799
## age          -0.06780234 0.1336198
exp(coef(glm_binomial3)) # exponentiated coefficients
## (Intercept)      lcavol     lweight         age 
## 0.001899712 7.372329880 0.785110058 1.032546950
exp(confint(glm_binomial3)) # 95% CI for exponentiated coefficients
## Waiting for profiling to be done...
##                    2.5 %    97.5 %
## (Intercept) 6.245344e-07  2.802456
## lcavol      3.269481e+00 21.085191
## lweight     1.233400e-01  4.848066
## age         9.344452e-01  1.142958

Model selection for logistic regression (also all other GLM)

step(glm(svi ~ ., data = prostate,  family = binomial()))
## Start:  AIC=55.72
## svi ~ lcavol + lweight + age + lbph + lcp + gleason + pgg45 + 
##     lpsa
## 
##           Df Deviance    AIC
## - lcavol   1   37.760 53.760
## - pgg45    1   37.772 53.772
## - lweight  1   37.782 53.782
## - gleason  1   37.794 53.794
## - lbph     1   38.248 54.248
## - age      1   38.907 54.907
## <none>         37.721 55.721
## - lcp      1   48.013 64.013
## - lpsa     1   49.249 65.249
## 
## Step:  AIC=53.76
## svi ~ lweight + age + lbph + lcp + gleason + pgg45 + lpsa
## 
##           Df Deviance    AIC
## - pgg45    1   37.809 51.809
## - gleason  1   37.839 51.839
## - lweight  1   37.857 51.857
## - lbph     1   38.248 52.248
## - age      1   38.912 52.912
## <none>         37.760 53.760
## - lcp      1   50.549 64.549
## - lpsa     1   51.558 65.558
## 
## Step:  AIC=51.81
## svi ~ lweight + age + lbph + lcp + gleason + lpsa
## 
##           Df Deviance    AIC
## - lweight  1   37.898 49.898
## - gleason  1   38.076 50.076
## - lbph     1   38.398 50.398
## - age      1   38.920 50.920
## <none>         37.809 51.809
## - lpsa     1   52.945 64.945
## - lcp      1   54.318 66.318
## 
## Step:  AIC=49.9
## svi ~ age + lbph + lcp + gleason + lpsa
## 
##           Df Deviance    AIC
## - gleason  1   38.102 48.102
## - lbph     1   38.758 48.758
## - age      1   39.002 49.002
## <none>         37.898 49.898
## - lpsa     1   53.462 63.462
## - lcp      1   54.379 64.379
## 
## Step:  AIC=48.1
## svi ~ age + lbph + lcp + lpsa
## 
##        Df Deviance    AIC
## - lbph  1   38.959 46.959
## - age   1   39.066 47.066
## <none>      38.102 48.102
## - lpsa  1   53.509 61.509
## - lcp   1   57.700 65.700
## 
## Step:  AIC=46.96
## svi ~ age + lcp + lpsa
## 
##        Df Deviance    AIC
## - age   1   39.653 45.653
## <none>      38.959 46.959
## - lpsa  1   54.840 60.840
## - lcp   1   59.815 65.815
## 
## Step:  AIC=45.65
## svi ~ lcp + lpsa
## 
##        Df Deviance    AIC
## <none>      39.653 45.653
## - lpsa  1   55.355 59.355
## - lcp   1   60.781 64.781
## 
## Call:  glm(formula = svi ~ lcp + lpsa, family = binomial(), data = prostate)
## 
## Coefficients:
## (Intercept)          lcp         lpsa  
##      -8.532        1.343        2.198  
## 
## Degrees of Freedom: 96 Total (i.e. Null);  94 Residual
## Null Deviance:       101.4 
## Residual Deviance: 39.65     AIC: 45.65

Estimation for \(\beta\)

Multinomial logistic Regression

When the outcome categories \(1,2,\ldots,J\) are unordered, a multinomial logit model defines \(J - 1\) logits with the form (treating \(j=1\) as the reference category):

\[\log (odds_{j1}) = \log(\frac{\mu_{ij}}{\mu_{i1}}) = \log(\frac{P(Y_i = j)}{P(Y_i = 1)}) = \beta_{0j} + \sum_k \beta_{kj} X_{ik}\]

Multinomial logistic Regression example

ml <- read.csv("https://caleb-huo.github.io/teaching/data/mlogistic_program/hsbdemo.csv",row.names=1)
knitr::kable(head(ml))
id female ses schtyp prog read write math science socst honors awards cid
45 female low public vocation 34 35 41 29 26 not enrolled 0 1
108 male middle public general 34 33 41 36 36 not enrolled 0 1
15 male high public vocation 39 39 44 26 42 not enrolled 0 1
67 male low public vocation 37 37 42 33 32 not enrolled 0 1
153 male middle public vocation 39 31 40 39 51 not enrolled 0 1
51 female high public general 42 36 42 31 39 not enrolled 0 1

Multinomial logistic Regression result

ml$prog2 <- relevel(factor(ml$prog), ref = "vocation")
ml$ses2 <- relevel(factor(ml$ses), ref = "low")

library("nnet")
test <- multinom(prog2 ~ ses2 + write, data = ml)
## # weights:  15 (8 variable)
## initial  value 219.722458 
## iter  10 value 179.993001
## final  value 179.981726 
## converged
summary(test)
## Call:
## multinom(formula = prog2 ~ ses2 + write, data = ml)
## 
## Coefficients:
##          (Intercept)   ses2high ses2middle      write
## academic   -5.218216  0.9826776 -0.2913902 0.11360290
## general    -2.366021 -0.1801588 -0.8246874 0.05567435
## 
## Std. Errors:
##          (Intercept)  ses2high ses2middle      write
## academic    1.163550 0.5955674  0.4763737 0.02221991
## general     1.174249 0.6484555  0.4901231 0.02333137
## 
## Residual Deviance: 359.9635 
## AIC: 375.9635

Multinomial logistic Regression interpretation

  1. A one-unit increase in write increases the log odds of being in academic program vs. vocation program by 0.11
  2. A one-unit increase in write increases the log odds of being in general program vs. vocation program by 0.056
  3. The log odds of being in academic program than in vocation program will increases by 0.98 if moving from ses=”low” to ses=”high”.

ordinal logistic regression

\[\log (\frac{P(Y > j | X)}{P(Y \le j | X)}) = \log (\frac{P(Y > j | X)}{1 - P(Y > j | X)}) = \alpha_j + \beta^T X\] For all \(j = 1, \ldots, J - 1\).

\[\log (\frac{P(Y > 1 | X)}{P(Y \le 1 | X)}) = \log (\frac{P(Y = 2 | X) + P(Y = 3 | X)}{P(Y = 1 | X)}) = \alpha_1 + \beta^T X\]

\[\log (\frac{P(Y > 2 | X)}{P(Y \le 2 | X)}) = \log (\frac{P(Y = 3 | X)}{P(Y = 1 | X) + P(Y = 2 | X)}) = \alpha_2 + \beta^T X\]

ordinal logistic regression

\[\log (\frac{P(Y > j | X)}{P(Y \le j | X)}) = \log (\frac{P(Y > j | X)}{1 - P(Y > j | X)}) = \alpha_j + \beta^T X\] For all \(j = 1, \ldots, J - 1\).

Ordinal Logistic Regression example

dat <- read.csv("https://caleb-huo.github.io/teaching/data/ologit/ologit.csv")
knitr::kable(head(dat))
apply pared public gpa
very likely 0 0 3.26
somewhat likely 1 0 3.21
unlikely 1 1 3.94
somewhat likely 0 0 2.81
somewhat likely 0 0 2.53
unlikely 0 1 2.59

Ordinal logistic Regression result

library(MASS)
dat$apply2 <- relevel(factor(dat$apply), ref = "unlikely")
m <- polr(apply2 ~ pared + public + gpa, data = dat, Hess=TRUE)
summary(m)
## Call:
## polr(formula = apply2 ~ pared + public + gpa, data = dat, Hess = TRUE)
## 
## Coefficients:
##           Value Std. Error t value
## pared   1.04769     0.2658  3.9418
## public -0.05879     0.2979 -0.1974
## gpa     0.61594     0.2606  2.3632
## 
## Intercepts:
##                             Value   Std. Error t value
## unlikely|somewhat likely     2.2039  0.7795     2.8272
## somewhat likely|very likely  4.2994  0.8043     5.3453
## 
## Residual Deviance: 717.0249 
## AIC: 727.0249
exp(coef(m))
##     pared    public       gpa 
## 2.8510579 0.9429088 1.8513972

Ordinal logistic Regression interpretation

  1. One unit increase in parental education, from 0 (Low) to 1 (High), the odds of “very likely” applying versus “somewhat likely” or “unlikely” applying combined are 2.85 greater .

  2. The odds “very likely” or “somewhat likely” applying versus “unlikely” applying is 2.85 times greater .

  3. For gpa, when a student’s gpa increases 1 unit, the odds of moving from “unlikely” applying to “somewhat likely” or “very likely” applying (or from the lower and middle categories to the high category) are multiplied by 1.85.

GLM for count data

Poisson Model (Components of GLM for count data)

Poisson distribution

Poisson Model

fit Poisson Model in R

library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
knitr::kable(head(bioChemists))
art fem mar kid5 phd ment
0 Men Married 0 2.52 7
0 Women Single 0 2.05 6
0 Women Single 0 3.75 6
0 Men Married 1 1.18 3
0 Women Single 0 3.75 26
0 Women Married 2 3.59 2

fit Poisson Model in R

glm_poisson <- glm(art ~ . , data = bioChemists,  family = poisson())
summary(glm_poisson) # display results
## 
## Call:
## glm(formula = art ~ ., family = poisson(), data = bioChemists)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5672  -1.5398  -0.3660   0.5722   5.4467  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.304617   0.102981   2.958   0.0031 ** 
## femWomen    -0.224594   0.054613  -4.112 3.92e-05 ***
## marMarried   0.155243   0.061374   2.529   0.0114 *  
## kid5        -0.184883   0.040127  -4.607 4.08e-06 ***
## phd          0.012823   0.026397   0.486   0.6271    
## ment         0.025543   0.002006  12.733  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1817.4  on 914  degrees of freedom
## Residual deviance: 1634.4  on 909  degrees of freedom
## AIC: 3314.1
## 
## Number of Fisher Scoring iterations: 5
confint(glm_poisson) # 95% CI for the coefficients
## Waiting for profiling to be done...
##                   2.5 %      97.5 %
## (Intercept)  0.10155469  0.50526581
## femWomen    -0.33193995 -0.11781821
## marMarried   0.03520230  0.27584836
## kid5        -0.26422746 -0.10689876
## phd         -0.03881122  0.06467552
## ment         0.02154163  0.02940716
exp(coef(glm_poisson)) # exponentiated coefficients
## (Intercept)    femWomen  marMarried        kid5         phd        ment 
##   1.3561053   0.7988403   1.1679422   0.8312018   1.0129051   1.0258718
exp(confint(glm_poisson)) # 95% CI for exponentiated coefficients
## Waiting for profiling to be done...
##                 2.5 %    97.5 %
## (Intercept) 1.1068905 1.6574260
## femWomen    0.7175304 0.8888576
## marMarried  1.0358292 1.3176480
## kid5        0.7677989 0.8986166
## phd         0.9619323 1.0668128
## ment        1.0217753 1.0298438

Interpretation of \(\beta\) for Poisson model

\(\beta_1\) is difference in log of expected counts when \(X\) increases by 1 unit.

\(\exp (\beta_1)\) is multiplicative effect of the mean of \(Y\) when \(X\) increases by 1 unit.

Model selection for Poisson regression (also all other GLM)

step(glm_poisson)
## Start:  AIC=3314.11
## art ~ fem + mar + kid5 + phd + ment
## 
##        Df Deviance    AIC
## - phd   1   1634.6 3312.3
## <none>      1634.4 3314.1
## - mar   1   1640.8 3318.5
## - fem   1   1651.5 3329.2
## - kid5  1   1656.5 3334.2
## - ment  1   1766.2 3444.0
## 
## Step:  AIC=3312.35
## art ~ fem + mar + kid5 + ment
## 
##        Df Deviance    AIC
## <none>      1634.6 3312.3
## - mar   1   1640.8 3316.6
## - fem   1   1651.8 3327.5
## - kid5  1   1656.7 3332.4
## - ment  1   1776.7 3452.5
## 
## Call:  glm(formula = art ~ fem + mar + kid5 + ment, family = poisson(), 
##     data = bioChemists)
## 
## Coefficients:
## (Intercept)     femWomen   marMarried         kid5         ment  
##     0.34517     -0.22530      0.15218     -0.18499      0.02576  
## 
## Degrees of Freedom: 914 Total (i.e. Null);  910 Residual
## Null Deviance:       1817 
## Residual Deviance: 1635  AIC: 3312

Poission regression for Rate data

Modeling the rate

burnData <- read.csv("https://caleb-huo.github.io/teaching/data/Burn/burn.csv", row.names = 1)
knitr::kable(head(burnData))
treatment gender race percent head buttock trunk upper_leg lower_leg resp_tract type numfup infection
0 0 0 15 0 0 1 1 0 0 2 12 0
0 0 1 20 0 0 1 0 0 0 4 9 0
0 0 1 15 0 0 0 1 1 0 2 7 1
0 0 0 20 1 0 1 0 0 0 2 29 0
0 0 1 70 1 1 1 1 0 0 2 4 1
0 0 1 20 1 0 1 0 0 0 4 8 1

Modeling the rate

glm_poisson_rate <- glm(infection ~ . - numfup + offset(log(numfup)), data = burnData,  family = poisson())
summary(glm_poisson_rate) # display results
## 
## Call:
## glm(formula = infection ~ . - numfup + offset(log(numfup)), family = poisson(), 
##     data = burnData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0008  -0.8023  -0.5350   1.0419   2.5011  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.603081   1.201833  -4.662 3.13e-06 ***
## treatment   -0.622350   0.316669  -1.965   0.0494 *  
## gender      -0.724283   0.396879  -1.825   0.0680 .  
## race         2.155901   1.016616   2.121   0.0340 *  
## percent      0.003402   0.009854   0.345   0.7299    
## head        -0.023392   0.353237  -0.066   0.9472    
## buttock      0.612703   0.421929   1.452   0.1465    
## trunk        0.058203   0.500184   0.116   0.9074    
## upper_leg   -0.472179   0.373669  -1.264   0.2064    
## lower_leg   -0.299164   0.387843  -0.771   0.4405    
## resp_tract  -0.042332   0.364401  -0.116   0.9075    
## type        -0.045439   0.171668  -0.265   0.7912    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 207.46  on 153  degrees of freedom
## Residual deviance: 184.46  on 142  degrees of freedom
## AIC: 304.46
## 
## Number of Fisher Scoring iterations: 7

Poisson Overdispersion

bioChemists data

mean(bioChemists$art)
## [1] 1.692896
var(bioChemists$art)
## [1] 3.709742

Negative Bionomial Regression

\[Y|\lambda \sim Poisson(\lambda)\] \[\lambda \sim gamma(\alpha, \beta)\] \[Y \sim NB(\alpha, 1/(\beta + 1))\]

Negative Binomial Distribution

The canonical link for the negative binomial distribution is rather complicated and hard to interpret, so it is rarely used. Instead to facilitate comparisons with the Poisson generalized linear model, a log link is typically used.

Negative Binomial Distribution

\[b''(\theta) = \frac{\partial^2 b}{\partial p^2} \times (\frac{\partial p}{\partial \theta})^2 + \frac{\partial b}{\partial p} \times \frac{\partial^2 p}{\partial \theta^2} = \frac{r(1-p)}{p^2} \]

\[b''(\theta) = \frac{\partial^2 b}{\partial p^2} \times (\frac{\partial p}{\partial \theta})^2 + \frac{\partial b}{\partial p} \times \frac{\partial^2 p}{\partial \theta^2} = \mu + \alpha \mu^2 \] The variance is always larger than the mean for the negative binomial.

NB is possion if \(\alpha = 0\)

Negative binomial example 1

library(MASS)
glm_nb <- glm.nb(art ~ . , data = bioChemists)
summary(glm_nb) # display results
## 
## Call:
## glm.nb(formula = art ~ ., data = bioChemists, init.theta = 2.264387695, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1678  -1.3617  -0.2806   0.4476   3.4524  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.256144   0.137348   1.865 0.062191 .  
## femWomen    -0.216418   0.072636  -2.979 0.002887 ** 
## marMarried   0.150489   0.082097   1.833 0.066791 .  
## kid5        -0.176415   0.052813  -3.340 0.000837 ***
## phd          0.015271   0.035873   0.426 0.670326    
## ment         0.029082   0.003214   9.048  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.2644) family taken to be 1)
## 
##     Null deviance: 1109.0  on 914  degrees of freedom
## Residual deviance: 1004.3  on 909  degrees of freedom
## AIC: 3135.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.264 
##           Std. Err.:  0.271 
## 
##  2 x log-likelihood:  -3121.917

Negative binomial example 2

glm_nb_rate <- glm.nb(infection ~ . - numfup + offset(log(numfup)), data = burnData)
## Warning in glm.nb(infection ~ . - numfup + offset(log(numfup)), data =
## burnData): alternation limit reached
summary(glm_nb_rate)
## 
## Call:
## glm.nb(formula = infection ~ . - numfup + offset(log(numfup)), 
##     data = burnData, init.theta = 0.6478173549, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7731  -0.8564  -0.5674   0.5065   2.3192  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.285673   1.300638  -4.064 4.83e-05 ***
## treatment   -0.840907   0.371826  -2.262   0.0237 *  
## gender      -1.037906   0.464514  -2.234   0.0255 *  
## race         2.620297   1.081204   2.423   0.0154 *  
## percent      0.005769   0.011903   0.485   0.6279    
## head        -0.208180   0.410433  -0.507   0.6120    
## buttock      0.899329   0.500276   1.798   0.0722 .  
## trunk        0.339320   0.565532   0.600   0.5485    
## upper_leg   -0.692250   0.446461  -1.551   0.1210    
## lower_leg   -0.180517   0.428152  -0.422   0.6733    
## resp_tract  -0.191947   0.437243  -0.439   0.6607    
## type        -0.149369   0.201242  -0.742   0.4579    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.6478) family taken to be 1)
## 
##     Null deviance: 149.78  on 153  degrees of freedom
## Residual deviance: 126.24  on 142  degrees of freedom
## AIC: 299.93
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.648 
##           Std. Err.:  0.246 
## Warning while fitting theta: alternation limit reached 
## 
##  2 x log-likelihood:  -273.928

Interpretation Using the Count (or Rate)

For a change of \(\Delta\) in \(x\), the expected count increases by a factor of \(\exp(\gamma \Delta)\) holding other variables \(Z = z\) constant.

zero inflated Poisson regression

\[ P(y_i = j)= \begin{cases} \pi_i + (1 - \pi_i) \exp (-\mu_i), \mbox{if } j=0\\ (1 - \pi_i) \frac{\mu_i^{y_i} \exp(-\mu_i)}{y_i!}, \mbox{if } j>0 \end{cases} \] - The Poisson component is - Design matrix \(X\) and coefficient \(\beta\) \[\log(\mu_i) = \beta_0x_{10} + \beta_1 x_{1i} + \ldots + \beta_px_{1p}\] - The zero proportion component is - Design matrix \(Z\) and coefficient \(\gamma\) \[\pi_i = \frac{\exp(\gamma_0z_{10} + \gamma_1 z_{1i} + \ldots + \gamma_pz_{1p})}{1 + \exp(\gamma_0z_{10} + \gamma_1 z_{1i} + \ldots + \gamma_pz_{1p})}\]

ZIP in R

fm_pois <- glm(art ~ ., data = bioChemists, family = poisson) 
## without inflation, regular Possion model

fm_zip <- zeroinfl(art ~ . | 1, data = bioChemists) 
## with simple inflation (intercept only for zero component)

fm_zip2 <- zeroinfl(art ~ . | ., data = bioChemists) 
## inflation with all regressors
## ("art ~ . | ." is "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment")

ZINB in R

fm_nb <- MASS::glm.nb(art ~ ., data = bioChemists)
## without inflation

fm_zinb <- zeroinfl(art ~ . | 1, data = bioChemists, dist = "negbin")
## with simple inflation (no regressors for zero component)

fm_zinb2 <- zeroinfl(art ~ . | ., data = bioChemists, dist = "negbin")
## inflation with regressors
## ("art ~ . | ." is "art ~ fem + mar + kid5 + phd + ment | fem + mar + kid5 + phd + ment")

Further extensions

We can extend these conditions for linear regression model

How about non linear systematic component: