Workshop 8: Generalized additive models

class: center, middle, inverse, title-slide

# Workshop 8: Generalized additive models
## QCBS R Workshop Series
### Québec Centre for Biodiversity Science

---

class: inverse, center, middle

# About this workshop
[![badge](https://img.shields.io/static/v1?style=for-the-badge&label=repo&message=dev&color=6f42c1&logo=github)](https://github.com/QCBSRworkshops/workshop08)
[![badge](https://img.shields.io/static/v1?style=for-the-badge&label=wiki&message=08&logo=wikipedia)](https://wiki.qcbs.ca/r_workshop8)
[![badge](https://img.shields.io/static/v1?style=for-the-badge&label=Slides&message=08&color=red&logo=html5)](https://qcbsrworkshops.github.io/workshop08/workshop08-en/workshop08-en.html)
[![badge](https://img.shields.io/static/v1?style=for-the-badge&label=Slides&message=08&color=red&logo=adobe-acrobat-reader)](https://qcbsrworkshops.github.io/workshop08/workshop08-en/workshop08-en.pdf)
[![badge](https://img.shields.io/static/v1?style=for-the-badge&label=script&message=08&color=2a50b8&logo=r)](https://qcbsrworkshops.github.io/workshop08/workshop08-en/workshop08-en.R)

---

# Required packages

* [ggplot2](https://cran.r-project.org/package=ggplot2)
* [itsadug](https://cran.r-project.org/package=itsadug)
* [mgcv](https://cran.r-project.org/package=mgcv)

```R
install.packages(c('ggplot2', 'itsadug', 'mgcv'))
```

---

## Workshop overview

1. The linear model... and where if fails
2. Introduction to GAM
3. Multiple smooth terms
4. Interactions
5. Changing basis
6. Other distributions
7. Quick intro to GAMM
8. GAM behind the scene

---
# Learning objectives

1. Use the mgcv package to fit non-linear relationships,
2. Understand the output of a GAM to help you understand your data,
3. Use tests to determine if a non-linear model fits better than a linear one,
4. Include smooth interactions between variables,
5. Understand the idea of a basis function, and why it makes GAMs so powerful,
6. Account for dependence in data (autocorrelation, hierarchical structure) using GAMMs.

---
# Prerequisites

> Some experience in R (enough to be able to run a script and examine data and R objects)
> a basic knowledge of regression (you should know what we mean by linear regression and ANOVA).

---
class: inverse, center, middle
# 1. The linear model

## ...and where if fails

---
# Linear regression

Regression is the workhorse of statistics. It allows us to model a response variable as a function of predictors plus error.

As we saw in the [linear models workshop](http://qcbs.ca/wiki/r_workshop4), regression makes 4 major assumptions:

1. Normally distributed error
2. Homogeneity of the variance
3. Indenpendance of the errors
4. The response is linear : `$y = β_0 + β_1x$`

---
# Linear regression

There's only one way for the linear model to be right:

.center[
![](images/linreg.png)
]

---
# Linear regression

And yet so many ways for it to fail:

.center[
![:scale 60%](images/linreg_bad.png)
]

---
# Linear regression

**What's the problem and do we fix it?**

A **linear model** tries to fit the best **straight line** that passes through the data, so it doesn't work well for all datasets.

In contrast, a **GAM** can capture complexe relationships by fitting a **non-linear smooth function** through the data, while controlling how wiggly the smooth can get (more on this later).

---
class: inverse, center, middle

## 2. Introduction to GAM

---
# Generalized Additive Models (GAM)

Let's look at an example. First, we'll generate some data, and plot it.

```r
library(ggplot2)
set.seed(10)
n <- 250
x <- runif(n,0,5)
y_model <- 3*x/(1+2*x)
y_obs <- rnorm(n,y_model,0.1)
data_plot <- qplot(x, y_obs) +
 geom_line(aes(y=y_model)) +
 theme_bw()
data_plot
```

---
# GAM

---
# GAM

Trying to fit these data as a linear regression model, we would violate the assumptions listed above.

---
# GAM

In GAM, the relationship between the response variable and the predictors is:

`$$y = \alpha + s(x_1) + s(x_2) + ... + \epsilon$$`

One big advantage of using GAM over a manual specification of the model is that the optimal shape, i.e. the degree of smoothness of `s(x)`, is determined automatically using a generalized cross-validation

---
# GAM

Let's try to fit the data using a smooth function with the function `mgcv::gam()`

```r
library(mgcv)
gam_model <- gam(y_obs ~ s(x))
summary(gam_model)

data_plot <- data_plot +
 geom_line(colour = "blue", size = 1.2, aes(y = fitted(gam_model)))
data_plot
```

---
# GAM

```
# 
# Family: gaussian 
# Link function: identity 
# 
# Formula:
# y_obs ~ s(x)
# 
# Parametric coefficients:
# Estimate Std. Error t value Pr(>|t|) 
# (Intercept) 1.154422 0.006444 179.1 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# Approximate significance of smooth terms:
# edf Ref.df F p-value 
# s(x) 8.317 8.872 171.3 <2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 
# R-sq.(adj) = 0.859 Deviance explained = 86.3%
# GCV = 0.010784 Scale est. = 0.010382 n = 250
```

---
# GAM

.comment[Note: as opposed to one fixed coefficient, \beta in linear regression, the smooth function can continually change over the range of the predictor x]

---
# GAM

The `mgcv` package also includes a default plot to look at the smooths:

```r
plot(gam_model)
```

---
# Test for linearity using GAM

We can use `gam()` and `anova()` to test whether an assumption of linearity is justified. To do so, we must simply set our smoothed model so that it is nested in our linear model.

```r
linear_model <- gam(y_obs ~ x) # fit a regular linear model using gam()
nested_gam_model <- gam(y_obs ~ s(x) + x)
anova(linear_model, nested_gam_model, test = "Chisq")
# Analysis of Deviance Table
# 
# Model 1: y_obs ~ x
# Model 2: y_obs ~ s(x) + x
# Resid. Df Resid. Dev Df Deviance Pr(>Chi) 
# 1 248.00 6.5846 
# 2 240.13 2.4988 7.8721 4.0858 < 2.2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

.comment[Note that the model `y_obs~s(x)` gives exactly the same results as `y_obs~s(x)+x`. We used the s(x)+x to illustrate the nestedness of the model, but the +x can be omitted.]

---
# Challenge 1 ![:cube]()

We will now try this comparison test with some new simulated data, just to get a handle on it.

```r
n <- 250
x_test <- runif(n, -5, 5)
y_test_fit <- 4 * dnorm(x_test)
y_test_obs <- rnorm(n, y_test_fit, 0.2)
```

1. Fit a linear and smoothed GAM model to the relation between `x_test` and `y_test_obs`.
2. Determine if linearity is justified for this data.
3. What is the estimated degrees of freedom of the smoothed term?

---
# Challenge 1 - Solution ![:cube]()

```r
linear_model_test <- gam(y_test_obs ~ x_test)
nested_gam_model_test <- gam(y_test_obs ~ s(x_test) + x_test)

anova(linear_model_test, nested_gam_model_test, test="Chisq")
# Analysis of Deviance Table
# 
# Model 1: y_test_obs ~ x_test
# Model 2: y_test_obs ~ s(x_test) + x_test
# Resid. Df Resid. Dev Df Deviance Pr(>Chi) 
# 1 248.0 78.995 
# 2 240.1 10.420 7.8988 68.574 < 2.2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

---
# Challenge 1 - Solution ![:cube]()

```r
qplot(x_test, y_test_obs) +
  geom_line(aes(y = y_test_fit)) +
  theme_bw()
```

---
# Challenge 1 - Solution ![:cube]()

```r
nested_gam_model_test
# 
# Family: gaussian 
# Link function: identity 
# 
# Formula:
# y_test_obs ~ s(x_test) + x_test
# 
# Estimated degrees of freedom:
# 7.51  total = 9.4 
# 
# GCV score: 0.04500348     rank: 10/11
```

**Answer** Yes non-linearity is justified. The estimated degrees of freedom (edf) are >> 1 (we'll get back to this soon).

---
class: inverse, center, middle

## 3. Multiple smooth terms

---
# GAM with multiple variables

GAMs make it easy to include both smooth and linear terms, multiple smoothed terms, and smoothed interactions.

For this section, we will use simulated data generated using `mgcv::gamSim()`.

```r
# ?gamSim
gam_data <- gamSim(eg = 5)
# Additive model + factor
head(gam_data)
# y x0 x1 x2 x3
# 1 4.723147 1 0.02573032 0.70706571 0.69248543
# 2 8.886671 2 0.83272144 0.84997218 0.88974095
# 3 11.196905 3 0.66302652 0.88025265 0.08469529
# 4 10.886068 4 0.11126873 0.80087554 0.15109792
# 5 12.270534 1 0.87969756 0.37692184 0.51467778
# 6 9.020910 2 0.12441532 0.05154493 0.86526950
```

We will try to model the response `y` using the predictors `x0` to `x3`.

---
# GAM with multiple variables

Let's start with a basic model, with one smoothed term (x1) and one categorical predictor (x0, which has 4 levels).

```r
basic_model <- gam(y ~ x0 + s(x1), data = gam_data)
basic_summary <- summary(basic_model)
basic_summary$p.table
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 8.550030 0.3655849 23.387258 1.717989e-76
# x02 2.418682 0.5165515 4.682364 3.908046e-06
# x03 4.486193 0.5156501 8.700072 9.124666e-17
# x04 6.528518 0.5204234 12.544629 1.322632e-30

basic_summary$s.table
#            edf   Ref.df        F      p-value
# s(x1) 1.923913 2.406719 42.43242 1.338683e-19
```

.comment[The `p.table` provides the significance table for each linear term

The `s.table` provides the significance table for each smoothed term.
]

---
# Note on estimated degrees of freedom

```r
basic_summary$s.table
#            edf   Ref.df        F      p-value
# s(x1) 1.923913 2.406719 42.43242 1.338683e-19
```

The `edf` shown in the `s.table` is the estimated degrees of freedom – essentially, a larger edf value implies more complex wiggly splines.

- A value close to 1 tend to be close to a linear term.

- A high value (8–10 or higher) means that the spline is highly non-linear.

> In our basic model the edf of smooth function s(x1) is ~2, which suggests a non-linear curve.

---
# Note on estimated degrees of freedom

The edf in GAM is different from the degrees of freedom in a linear regression.

In linear regression, the *model* degrees of freedom is equivalent to the number of non-redundant free parameters, p, in the model (and the *residual* degrees of freedom are given by n-p).

We will revisit the edf later in this workshop.

---
# GAM with multiple variables

```r
plot(basic_model)
```

---
# GAM with multiple variables

We can add a second term, `x2`, but specify a linear relationship with `y`

```r
two_term_model <- gam(y ~ x0 + s(x1) + x2, data = gam_data)
two_term_summary <- summary(two_term_model)
two_term_summary$p.table
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 11.400658 0.4177614 27.289879 9.396207e-93
# x02 2.314405 0.4552138 5.084216 5.723467e-07
# x03 4.487653 0.4543299 9.877520 1.063008e-20
# x04 6.596149 0.4585778 14.383925 5.468771e-38
# x2 -5.825948 0.5436671 -10.716021 1.114046e-23

two_term_summary$s.table
#            edf   Ref.df        F      p-value
# s(x1) 1.900864 2.377544 49.85908 2.287393e-22
```

---
# GAM with multiple variables

We can add a second term, `x2`, but specify a linear relationship with `y`

```r
plot(two_term_model)
```

---
# GAM with multiple variables

We can also explore whether the relationship between `y` and `x2` is non-linear

```r
two_smooth_model <- gam(y ~ x0 + s(x1) + s(x2), data = gam_data)
two_smooth_summary <- summary(two_smooth_model)
two_smooth_summary$p.table
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 8.937862 0.2217506 40.305927 2.373755e-140
# x02 2.008045 0.3137690 6.399756 4.518133e-10
# x03 3.832496 0.3143049 12.193562 3.758930e-29
# x04 6.041521 0.3145299 19.208098 3.520507e-58

two_smooth_summary$s.table
#            edf   Ref.df        F       p-value
# s(x1) 2.546757 3.175726 68.10051  9.199287e-40
# s(x2) 7.726989 8.582003 81.55441 2.326028e-120
```

---
# GAM with multiple variables

We can also explore whether the relationship between `y` and `x2` is non-linear

```r
plot(two_smooth_model, page = 1)
```

---
# GAM with multiple variables

As before, we can perform an ANOVA to test if the smoothed term is necessary

```r
anova(basic_model, two_term_model, two_smooth_model, test = "Chisq")
# Analysis of Deviance Table
# 
# Model 1: y ~ x0 + s(x1)
# Model 2: y ~ x0 + s(x1) + x2
# Model 3: y ~ x0 + s(x1) + s(x2)
# Resid. Df Resid. Dev Df Deviance Pr(>Chi) 
# 1 393.59 5231.6 
# 2 392.62 4051.3 0.97082 1180.2 < 2.2e-16 ***
# 3 384.24 1839.5 8.38019 2211.8 < 2.2e-16 ***
# ---
# Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

.alert[The best fit model is the model with both smooth terms for `x1` and `x2`]

---
# Challenge 2 ![:cube]()

1. Create 2 new models, with `x3` as a linear and smoothed term.
2. Determine if `x3` is an important term to include using plots, coefficient tables and the anova function.

---
# Challenge 2 - Solution ![:cube]()

```r
three_term_model <- gam(y ~ x0 + s(x1) + s(x2) + x3, data = gam_data)
three_smooth_model <- gam(y~x0 + s(x1) + s(x2) + s(x3), data = gam_data)
three_smooth_summary <- summary(three_smooth_model)
```

---
# Challenge 2 - Solution ![:cube]()

```r
plot(three_smooth_model, page = 1)
```

---
# Challenge 2 - Solution ![:cube]()

```r
three_smooth_summary$s.table
#            edf   Ref.df           F       p-value
# s(x1) 2.542296 3.170441 67.75922314  1.577435e-39
# s(x2) 7.731424 8.584355 80.90188472 2.006819e-119
# s(x3) 1.000000 1.000000  0.02039697  8.865087e-01

# edf = 1 therefore term is linear.

anova(two_smooth_model, three_term_model, test = "Chisq")
# Analysis of Deviance Table
# 
# Model 1: y ~ x0 + s(x1) + s(x2)
# Model 2: y ~ x0 + s(x1) + s(x2) + x3
#   Resid. Df Resid. Dev      Df Deviance Pr(>Chi)
# 1    384.24     1839.5                          
# 2    383.25     1839.3 0.99707  0.18818   0.8418

# term x3 is not significant, it should be dropped!
```

---
class: inverse, center, middle

## 4. Interactions