# Chapter 6 Redundancy analysis

Redundancy Analysis (RDA) is a direct extension of multiple regression, as it models the effect of an explanatory matrix \(X\) (n x p) on a response matrix \(Y\) (n x m). The difference here is that we can model effect of an explanatory matrix on a **response matrix**, rather than a single response variable. For example, RDA allows us to model the effect of environmental variables on the entire community, rather than on species richness. This is done by performing an ordination of \(Y\) to obtain ordination axes that are linear combinations of the variables in \(X\).

Note that the explanatory variables in \(X\) can be **quantitative, qualitative or binary variables**. If they are quantitative, explanatory variables in \(X\) must be **centered**, **standardized** (if explanatory variables are in different units), **transformed** (to limit the skew of explanatory variables) or **normalized** (to linearize relationships) following the same principles as in PCA. Collinearity between the \(X\) variables should also be reduced before RDA. We began this process while exploring the data: our community data is Hellinger-transformed, and our environmental variables are centered and standardized. However, we still have some collinearity issues that have not been solved.

Sometimes, we have more explanatory variables than we need to understand the drivers of our response variable. The best way to build a model is always to use **ecological reasoning** when determining which variables should be included or excluded. However, if there are still too many variables included in the model, or if some of them are highly collinear, explanatory variables can be selected by forward, backward or stepwise **selection** that remove non-significant explanatory variables. However, this approach should always be taken after the *ecological* selection of variables, according to your knowledge of the system. We will see more about this later!

## 6.1 How the RDA works

Redundancy analysis as a two-step process (P. Legendre and Legendre 2012). The first step is a multiple regression, where each object in \(Y\) is regressed on the explanatory variables in \(X\), which results in a matrix of fitted values \(Y_{fit}\). This step is calculated through the following linear equation:

\[Y_{fit} = X[X'X]^{-1}X'Y\]

In the second step, we apply a principal components analysis (PCA) on the fitted matrix \(Y_{fit}\) to reduce dimensionality, i.e. to obtain the eigenvalues and eigenvectors. We then obtain a matrix \(Z\) which contains the canonical axes, which correspond to *linear* combinations of the explanatory variables in the space of \(X\). The *linearity* of the combinations of the \(X\) variables is a fundamental property of RDA. In the analysis of community composition, these canonical axes are interpreted as complex environmental gradients.

Once the RDA is computed, several statistics can be computed to interpret the explanatory power of the included variables and whether the observed relationships are significant. These include:

- \(R^2\), which measures the strength of the canonical relationship between \(Y\) and \(X\) by calculating the proportion of the variation of \(Y\) explained by the variables in \(X\),

- Adjusted \(R^2\), which also measures the strength of the relationship between \(Y\) and \(X\), but applies a
**correction**of the \(R^2\) to take into account the number of explanatory variables.*This is the statistic that should be reported.* - The F-statistic corresponds to an overall test of significance of an RDA by comparing the computed model to a null model. This test is based on the null hypothesis that the strength of the linear relationship calculated by the \(R^2\) is not larger than the value that would be obtained for unrelated \(Y\) and \(X\) matrices of the same size. Note that F-statistics can also be used to sequentially test the significance of each canonical axis.

## 6.2 Running an RDA in R

An RDA can be computed using the function `rda()`

from the `vegan`

package, as follows:

**Step 1**: Standardize and/or transform the data.

We already applied a Hellinger transformation to our community matrix, and standardized our explanatory variables in the section: 4. However, we noticed that the variable `das`

was collinear with several other variables. We will begin by removing this variable:

```
# We'll use our standardized environmental data, but we
# will remove 'das', which was correlated with many other
# variables:
<- subset(env.z, select = -das) env.z
```

**Step 2**: Run the RDA.

```
# Model the effect of all environmental variables on fish
# community composition
<- rda(spe.hel ~ ., data = env.z) spe.rda
```

**Step 3**: Extract key results of the RDA.

`summary(spe.rda)`

The first section of the summary contains the pieces we need to verify the performance of our RDA. Let us break it down:

```
...
## Partitioning of variance:
## Inertia Proportion
## Total 0.5025 1.0000
## Constrained 0.3689 0.7341
## Unconstrained 0.1336 0.2659
...
```

**Constrained Proportion**: variance of \(Y\) explained by \(X\) (73.41%)**Unconstrained Proportion**: unexplained variance in \(Y\) (26.59%)

How would you report these results? You could say: “The included environmental variables explain

73.41%of the variation in fish community composition across sites.”

The rest of the RDA summary is not printed here, because it is long. Aside from the section printed above, the summary contains:

**Eigenvalues, and their contribution to the variance**

**Accumulated constrained eigenvalues**, including the cumulative proportion of explained variance by each axis in the final RDA ordination. These axes represent the rescaled environmental variables. If you need to select a subset of axes for other analyses, you can use this cumulative proportion to select the first few axes until you reach a threshold of your choice.

**Scores**for species, sites, and the explanatory variables, which are the coordinates of each of these objects in the RDA space. The default scaling is of type 2 (we will come back to this).

### 6.2.1 Selecting variables

If we want to simplify this model, we can perform a forward selection (or backwards or stepwise). These types of selections help us select variables that are **statistically important**. *However, it is important to note that selecting variables ecologically is much more important than performing selection in this way. If a variable of ecological interest is not selected, this does not mean it has to be removed from the RDA.*

Here, we will be performing forward selection on our 11 environmental variables. To do this, we can use the `ordiR2step()`

function (or using the `forward.sel`

function of package `packfor`

):

```
# Forward selection of variables:
<- ordiR2step(rda(spe.hel ~ 1, data = env.z), # lower model limit (simple!)
fwd.sel scope = formula(spe.rda), # upper model limit (the "full" model)
direction = "forward",
R2scope = TRUE, # can't surpass the "full" model's R2
pstep = 1000,
trace = FALSE) # change to TRUE to see the selection process!
```

Here, we are essentially adding one variable at a time, and retaining it if it significantly increases the model’s adjusted \(R^2\).

**Which variables are retained by the forward selection?**

```
# Check the new model with forward-selected variables
$call fwd.sel
```

`## rda(formula = spe.hel ~ alt + oxy + dbo, data = env.z)`

**What is the adjusted R2 of the RDA with the selected variables?**

```
# Write our new model
<- rda(spe.hel ~ alt + oxy + dbo, data = env.z)
spe.rda.signif # check the adjusted R2 (corrected for the number of
# explanatory variables)
RsquareAdj(spe.rda.signif)
```

```
## $r.squared
## [1] 0.5894243
##
## $adj.r.squared
## [1] 0.5401552
```

The explanatory variables (altitude, oxygen and biological oxygen demand) now explain *59%* of the variance in \(Y\) (species abundances across sites, or community composition). When we correct for the number of variables in \(X\), the adjusted \(R^2\) tells us that three selected variables explain *54%* of the variance in species abundances.

Because the adjusted \(R^2\) is corrected for the number of explanatory variables, it is comparable across models and datasets. For this reason, you should report the adjusted \(R^2\) when writing up the result of an RDA for an article, or in a study which compares the explanatory power of different models.

### 6.2.2 Significance testing

The significance of your RDA can be tested using the function `anova.cca()`

.

`anova.cca(spe.rda.signif, step = 1000)`

```
...
## Df Variance F Pr(>F)
## Model 3 0.29619 11.963 0.001 ***
## Residual 25 0.20632
## ---
...
```

You can also test the significance of each variable with `by = "term"`

.

`anova.cca(spe.rda.signif, step = 1000, by = "term")`

```
...
## Model: rda(formula = spe.hel ~ alt + oxy + dbo, data = env.z)
## Df Variance F Pr(>F)
## alt 1 0.164856 19.9759 0.001 ***
## oxy 1 0.082426 9.9877 0.001 ***
## dbo 1 0.048909 5.9264 0.002 **
## Residual 25 0.206319
...
```

You can also test the significance of each canonical axis with `by = "axis"`

. Recall that these axes represent the variation in explanatory variables in fewer dimensions.

`anova.cca(spe.rda.signif, step = 1000, by = "axis")`

```
...
## Model: rda(formula = spe.hel ~ alt + oxy + dbo, data = env.z)
## Df Variance F Pr(>F)
## RDA1 1 0.218022 26.4181 0.001 ***
## RDA2 1 0.050879 6.1651 0.001 ***
## RDA3 1 0.027291 3.3069 0.001 ***
## Residual 25 0.206319
...
```

Our full model is statistically significant (p = 0.001), and every variable included in this model is significant as well (p = 0.001). Every canonical axis resulting from the RDA is also statistically significant (p = 0.001).

### 6.2.3 RDA plot

One of the most powerful aspects of RDA is the **simultaneous visualization** of your response and explanatory variables (*i.e.* species and environmental variables).

As with the PCA in (Workshop 9)[https://github.com/QCBSRworkshops/workshop09], there are two types of scaling:

Type 1 | Type 2 |
---|---|

Distances among objects reflect their similarities | Angles between variables reflect their correlation |

```
# Type 1 scaling
ordiplot(spe.rda.signif, scaling = 1, type = "text")
# Type 2 scaling
ordiplot(spe.rda.signif, scaling = 2, type = "text")
```

** Scaling 1** shows similarities between objects in the

**response matrix**.

- Sites (numbers) that are closer together have
*more similar*communities.

- Species that are closer together occupy more sites
*in common*.

** Scaling 2** shows the effects of

**explanatory variables**.

- Longer arrows mean this variable
*strongly*drives the variation in the community matrix.

- Arrows pointing in opposite directions have a
*negative*relationship.

- Arrows pointing in the same direction have a
*positive*relationship.

#### 6.2.3.1 Customizing RDA plots

Both `plot()`

and `ordiplot()`

make quick and simple ordination plots, but you can customize your plots by extracting scores with `scores()`

and manually setting the aesthetics of `points()`

, `text()`

, and `arrows()`

. Here is an example of a custom triplot. Feel free to play around with the colours and other parameters to make it your own!

```
# Custom triplot code!
## extract % explained by the first 2 axes
<- round(100*(summary(spe.rda.signif)$cont$importance[2, 1:2]), 2)
perc
## extract scores - these are coordinates in the RDA space
<- scores(spe.rda.signif, display="sites", choices=c(1,2), scaling=1)
sc_si <- scores(spe.rda.signif, display="species", choices=c(1,2), scaling=1)
sc_sp <- scores(spe.rda.signif, display="bp", choices=c(1, 2), scaling=1)
sc_bp
## Custom triplot, step by step
# Set up a blank plot with scaling, axes, and labels
plot(spe.rda.signif,
scaling = 1, # set scaling type
type = "none", # this excludes the plotting of any points from the results
frame = FALSE,
# set axis limits
xlim = c(-1,1),
ylim = c(-1,1),
# label the plot (title, and axes)
main = "Triplot RDA - scaling 1",
xlab = paste0("RDA1 (", perc[1], "%)"),
ylab = paste0("RDA2 (", perc[2], "%)")
)# add points for site scores
points(sc_si,
pch = 21, # set shape (here, circle with a fill colour)
col = "black", # outline colour
bg = "steelblue", # fill colour
cex = 1.2) # size
# add points for species scores
points(sc_sp,
pch = 22, # set shape (here, square with a fill colour)
col = "black",
bg = "#f2bd33",
cex = 1.2)
# add text labels for species abbreviations
text(sc_sp + c(0.03, 0.09), # adjust text coordinates to avoid overlap with points
labels = rownames(sc_sp),
col = "grey40",
font = 2, # bold
cex = 0.6)
# add arrows for effects of the expanatory variables
arrows(0,0, # start them from (0,0)
1], sc_bp[,2], # end them at the score value
sc_bp[,col = "red",
lwd = 3)
# add text labels for arrows
text(x = sc_bp[,1] -0.1, # adjust text coordinate to avoid overlap with arrow tip
y = sc_bp[,2] - 0.03,
labels = rownames(sc_bp),
col = "red",
cex = 1,
font = 2)
```

## 6.3 Challenge 1

Run an RDA to model the effects of environmental variables on mite species abundances.

The

`mite`

dataset is part of the`vegan`

package, so you do not need to have it stored as a .csv in your repository.

To get started, load the `mite`

data:

```
# Load mite species abundance data
data("mite")
# Load environmental data
data("mite.env")
```

Recall some useful functions:

```
decostand()
rda()
ordiR2step()
anova.cca()
ordiplot()
```

### 6.3.1 Challenge 1: Solution

**Step 1:** Transform and standardize the data.

```
# Hellinger transform the community data
<- decostand(mite, method = "hellinger")
mite.spe.hel
# Standardize quantitative environmental data
$SubsDens <- decostand(mite.env$SubsDens, method = "standardize")
mite.env$WatrCont <- decostand(mite.env$WatrCont, method = "standardize") mite.env
```

**Step 2:** Select environmental variables.

```
# Initial RDA with ALL of the environmental data
<- rda(mite.spe.hel ~ ., data = mite.env)
mite.spe.rda
# Forward selection of environmental variables
<- ordiR2step(rda(mite.spe.hel ~ 1, data = mite.env),
fwd.sel scope = formula(mite.spe.rda), direction = "forward", R2scope = TRUE,
pstep = 1000, trace = FALSE)
$call fwd.sel
```

```
## rda(formula = mite.spe.hel ~ WatrCont + Shrub + Substrate + Topo +
## SubsDens, data = mite.env)
```

**Step 3:** Run the RDA and check its explanatory power.

```
# Re-run the RDA with the significant variables
<- rda(mite.spe.hel ~ WatrCont + Shrub +
mite.spe.rda.signif + Topo + SubsDens, data = mite.env)
Substrate
# Find the adjusted R2 of the model with the retained env
# variables
RsquareAdj(mite.spe.rda.signif)$adj.r.squared
```

`## [1] 0.4367038`

**Step 4:** Test model significance.

`anova.cca(mite.spe.rda.signif, step = 1000)`

```
## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
##
## Model: rda(formula = mite.spe.hel ~ WatrCont + Shrub + Substrate + Topo + SubsDens, data = mite.env)
## Df Variance F Pr(>F)
## Model 11 0.20759 5.863 0.001 ***
## Residual 58 0.18669
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```

We find that four explanatory variables are retained after forward selection: WatrCont, Shrub, Substrate, and Topo. The selected environmental variables significantly explain **43.7% (p = 0.001)** of the variation in mite species abundances.

**Step 5:** Plot the RDA results!

```
# Scaling 1
ordiplot(mite.spe.rda.signif, scaling = 1, main = "Mite RDA - Scaling 1")
# Scaling 2
ordiplot(mite.spe.rda.signif, scaling = 2, main = "Mite RDA - Scaling 2")
```

** Scaling 1** shows similarities between objects in the

**response matrix**.

- The sites (black circles) are overall similar

- A few species (red +) stand out from the cluster near the middle, meaning they do not occupy many sites in common with other species. These species might therefore be rare or unique in some ecological way.

** Scaling 2** shows the effects of

**explanatory variables**.

- Substrate density and water content have long arrows, and therefore strong effects
- Shrubs have an opposite effect to WatrCont and SubsDens, because Shrub arrows are in opposite directions to these variables.
- Sites vary a lot in terms of SubsDens, and less in terms of WatrCont.

### References

*Numerical Ecology*. Elsevier.