Chapter 8 Binomial GLM

A common response variable in ecological data sets is the binary variable: we observe a phenomenon \(Y\) or its “absence”. For example, species presence/absence is frequently recorded in ecological monitoring studies. We usually wish to determine whether a species’ presence is affected by some environmental variables. Other examples include the presence/absence of a disease within a wild population, the success/failure to record a specific behaviour, and the survival/death of organisms. Usually, we are interested in questions such as: how do species occurrences vary in function of the environment?

\[Occurrences = f(Environment)\]

Under a linear model, expected values can be out of the [0, 1] range with lm():

# set up some binary data
Pres <- c(rep(1, 40), rep(0, 40))
rnor <- function(x) rnorm(1, mean = ifelse(x == 1, 12.5, 7.5), sd = 2)
ExpVar <- sapply(Pres, rnor)

# linear model with binary data...
lm(Pres ~ ExpVar)

When we predict the probability to observe a phenomenon \(Y\) with a binary variable, the predicted value has to be between 0 and 1: it is the possible range of probability!

8.2 Example

Let’s build our first generalized linear model! Here, we want to build a logistic regression model using the mites data

# Exercise 1 - our first GLM!
# setwd('...')

mites <- read.csv("data/mites.csv", header = TRUE)
str(mites)
## 'data.frame':    70 obs. of  9 variables:
##  $ Galumna   : int  8 3 1 1 2 1 1 1 2 5 ...
##  $ pa        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ totalabund: int  140 268 186 286 199 209 162 126 123 166 ...
##  $ prop      : num  0.05714 0.01119 0.00538 0.0035 0.01005 ...
##  $ SubsDens  : num  39.2 55 46.1 48.2 23.6 ...
##  $ WatrCont  : num  350 435 372 360 204 ...
##  $ Substrate : chr  "Sphagn1" "Litter" "Interface" "Sphagn1" ...
##  $ Shrub     : chr  "Few" "Few" "Few" "Few" ...
##  $ Topo      : chr  "Hummock" "Hummock" "Hummock" "Hummock" ...

We can fit the logistic regression model of the presence of Galumna sp. as a function of water content and topography as follows, using the glm() function and the family argument:

logit.reg <- glm(pa ~ WatrCont + Topo,
                 data = mites,
                 family = binomial(link = "logit"))

To see the model output, we run:

summary(logit.reg)
## 
## Call:
## glm(formula = pa ~ WatrCont + Topo, family = binomial(link = "logit"), 
##     data = mites)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0387  -0.5589  -0.1594   0.4112   2.0252  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.464402   1.670622   2.672 0.007533 ** 
## WatrCont    -0.015813   0.004535  -3.487 0.000489 ***
## TopoHummock  2.090757   0.735348   2.843 0.004466 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 91.246  on 69  degrees of freedom
## Residual deviance: 48.762  on 67  degrees of freedom
## AIC: 54.762
## 
## Number of Fisher Scoring iterations: 6

Doesn’t this structure resembles the one from summary.lm()? IT does, right? However you might notice that there are some special differences (e.g. dispersion parameter) we will discuss further in this book!

8.3 Challenge 1

Using the bacteria dataset (from the MASS package), model the presence of H. influenzae as a function of treatment and week of test. Start with a full model and reduce it to the most parsimonious model.

Load the MASS package and the bacteria dataset:

## 'data.frame':    220 obs. of  6 variables:
##  $ y   : Factor w/ 2 levels "n","y": 2 2 2 2 2 2 1 2 2 2 ...
##  $ ap  : Factor w/ 2 levels "a","p": 2 2 2 2 1 1 1 1 1 1 ...
##  $ hilo: Factor w/ 2 levels "hi","lo": 1 1 1 1 1 1 1 1 2 2 ...
##  $ week: int  0 2 4 11 0 2 6 11 0 2 ...
##  $ ID  : Factor w/ 50 levels "X01","X02","X03",..: 1 1 1 1 2 2 2 2 3 3 ...
##  $ trt : Factor w/ 3 levels "placebo","drug",..: 1 1 1 1 3 3 3 3 2 2 ...

This dataset was made to test the presence of the bacteria H. influenzae in children with otitis media in the Northern Territory of Australia. Dr A. Leach tested the effects of a drug on 50 children with a history of otitis media in the Northern Territory of Australia. The children were randomized to the drug or a placebo. The presence of H. influenzae was checked at weeks 0, 2, 4, 6 and 11: 30 of the checks were missing and are not included in this data frame.

Click here to see the solution!
# Challenge 1 - Solution

# Fit models (full to most parsimonious)
model.bact1 <- glm(y ~ trt * week, data = bacteria, family = binomial)
model.bact2 <- glm(y ~ trt + week, data = bacteria, family = binomial)
model.bact3 <- glm(y ~ week, data = bacteria, family = binomial)

# Let's compare these models using a likelihood ratio test (LRT).
anova(model.bact1, model.bact2, model.bact3, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: y ~ trt * week
## Model 2: y ~ trt + week
## Model 3: y ~ week
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)  
## 1       214     203.12                       
## 2       216     203.81 -2  -0.6854  0.70984  
## 3       218     210.91 -2  -7.1026  0.02869 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Which model is the best candidate?

Based on these results, we select model #2 as the best candidate to model these data.

8.4 Interpreting the output of a logistic regression

The output of our logistic regression indicates that both water content (WatrCont) and topography (‘Topo’) are significant:

# Extracting model coefficients
summary(logit.reg)$coefficients
##                Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept)  4.46440199 1.670622482  2.672299 0.0075333598
## WatrCont    -0.01581255 0.004535069 -3.486728 0.0004889684
## TopoHummock  2.09075654 0.735348234  2.843220 0.0044660283

But how do we interpret the slope coefficients? Remember that we applied a transformation on our expected values (i.e. the probability that \(Y = 1\) so we have to use an inverse function to properly interpret the results.

8.5 Predictive power and goodness-of-fit

An easy and intuitive way to evaluate the predictive power of your model is to compare its deviance to the deviance of a null model. Deviance can be understood as a generalisation of the residual sum of squares when models are estimated by maximum likelihood (i.e. it is how parameters are estimated in GLM). This allows us to compute a pseudo-R2 statistic, which is analogous to the coefficient of determination R2 in ordinary least square regression (i.e. the basic method for linear models). The generic formula to compute a pseudo-R2 is given by:

\[\text{pseudo-R}^2 = \dfrac{\text{null deviance} - \text{residual deviance}}{\text{null deviance}}\]

where “null deviance” is the deviance of the null model and “residual deviance” is the deviance of the model of interest. The difference is divided by the null deviance so that the result is bound between 0 and 1.

A null model is a model without any predictor. In R, we create a null model this way : null.model <- glm(Response.variable ~ 1, family = binomial).

The unit deviance is a measure of distance between \(y\) and \(μ\).

\[{\displaystyle d(y,y)=0}\] \[{\displaystyle d(y,\mu )>0\quad \forall y\neq \mu }\]

The total deviance \({\displaystyle D(\mathbf {y} ,{\hat {\boldsymbol {\mu }}})}\) of a model with predictions \({\hat {\boldsymbol {\mu }}}\) of the observation \(\mathbf {y}\) is the sum of its unit deviances: \[{\displaystyle D(\mathbf {y} ,{\hat {\boldsymbol {\mu }}})=\sum _{i}d(y_{i},{\hat {\mu }}_{i})}\] Now, the deviance of a model that has estimates \({\hat {\mu }}=E[Y|{\hat {\theta }}_{0}]\) can be defined by its likelihood: \[D(y,{\hat {\mu }})=2{\Big (}\log {\big (}p(y\mid {\hat {\theta }}_{s}){\big )}-\log {\big (}p(y\mid {\hat {\theta }}_{0}){\big )}{\Big )}\] with \(\hat \theta_0\) denoting the fitted values of the parameters in the reduced model, while \({\displaystyle {\hat {\theta }}_{s}}\hat \theta_s\) denotes the fitted parameters for the saturated model.

The residual deviance is defined as 2 times the log-likelihood ratio of the full model compared to the reduced model: \[D(y,{\hat {\mu }})=2{\Big (}\log {\big (}p(\text{saturated model}){\big )}-\log {\big (}p(\text{reduced model}){\big )}{\Big )}\]

And, the null deviance is defined 2 times the log-likelihood ratio of the full model compared to the null model (i.e. predictors are set to 1).

\[D(y,{\hat {\mu }})=2{\Big (}\log {\big (}p(\text{saturated model}){\big )}-\log {\big (}p(\text{null model}){\big )}{\Big )}\]

Now we can run this in R. Let us compare the deviance of your model (residual deviance) to the deviance of a null model (null deviance). The null model is a model without any explanatory variable and it looks like this:

null.model <- glm(response.variable ~ 1, family = binomial)

The saturated (or full) deviance model is a model with all explanatory variables:

full.model <- glm(response.variable ~ ., family = binomial)

Residual and null deviances are already stored in the glm object:

# extract residual and null deviances
objects(logit.reg)
##  [1] "aic"               "boundary"          "call"             
##  [4] "coefficients"      "contrasts"         "control"          
##  [7] "converged"         "data"              "deviance"         
## [10] "df.null"           "df.residual"       "effects"          
## [13] "family"            "fitted.values"     "formula"          
## [16] "iter"              "linear.predictors" "method"           
## [19] "model"             "null.deviance"     "offset"           
## [22] "prior.weights"     "qr"                "R"                
## [25] "rank"              "residuals"         "terms"            
## [28] "weights"           "xlevels"           "y"

We can then use these deviance values to calculate the pseudo-R2 value:

# calculate the pseudo-R2
pseudoR2 <- (logit.reg$null.deviance - logit.reg$deviance) / logit.reg$null.deviance
pseudoR2
## [1] 0.4655937

Hence, the model explains 46.6% of the variability in the data.

An adjusted McFadden’s pseudo-R2, which penalizes for the number of predictors, can be calculated as below:

\[ R^2_{adj} = 1 - \frac{logL(M)-K}{logL(M_{null})} \]

where K corresponds to the additional number of predictors in relation to the null model.

The goodness-of-fit of logistic regression models can be expressed by variants of \(pseudo-R^2\) statistics, such as Maddala (1983) or Cragg and Uhler (1970) measures.

When talking about logistic regressions, low R2 values are common.

The R function DescTools::PseudoR2() makes it possible to calculate many types of \(pseudo-R^2\). By specifying which = all, calculate all of them at once.

# Calculate many pseudo-R2!
logit.reg <- glm(pa ~ WatrCont + Topo,
                 data = mites,
                 family = binomial(link = "logit"))
DescTools::PseudoR2(logit.reg, which = "all")
##        McFadden     McFaddenAdj        CoxSnell      Nagelkerke   AldrichNelson 
##       0.4655937       0.3998373       0.4549662       0.6245898       0.3776866 
## VeallZimmermann           Efron McKelveyZavoina            Tjur             AIC 
##       0.6674318       0.5024101       0.7064093       0.5114661      54.7623962 
##             BIC          logLik         logLik0              G2 
##      61.5078819     -24.3811981     -45.6229593      42.4835224

8.5.1 Challenge 2

Assess goodness-of-fit and predictive power of the model.bact2 model. How can you improve the predictive power of this model?

Click here to see the solution!
# Challenge 2 - Solution

# Extract null and residual deviance
null.d <- model.bact2$null.deviance
resid.d <- model.bact2$deviance

# Calculate pseudo-R2
bact.pseudoR2 <- (null.d - resid.d) / null.d
bact.pseudoR2
## [1] 0.0624257

This is very low!

Adding informative explanatory variables could increase the explanatory power of the model.

But, do not be afraid of non-significant results!

8.6 Visual representation of results

Once we have validated our model, we would likely want to visually represent our results. Here is an example with ggplot2. See workshop 3for a reminder for this package.

ggplot(mites, aes(x = WatrCont, y = pa)) + geom_point() +
stat_smooth(method = "glm", method.args = list(family=binomial), se = TRUE) + xlab("Water content") +
ylab("Probability of presence") +
ggtitle("Probability of presence of  Galumna sp. against the water content")+theme_classic()