Chapter 6 Redundancy analysis

Redundancy Analysis (RDA) is a direct extension of multiple regression, as it models the effect of an explanatory matrix \(X\) (n x p) on a response matrix \(Y\) (n x m). The difference here is that we can model effect of an explanatory matrix on a response matrix, rather than a single response variable. For example, RDA allows us to model the effect of environmental variables on the entire community, rather than on species richness. This is done by performing an ordination of \(Y\) to obtain ordination axes that are linear combinations of the variables in \(X\).

The basic structure of a redundancy analysis (RDA).

Figure 6.1: The basic structure of a redundancy analysis (RDA).

Note that the explanatory variables in \(X\) can be quantitative, qualitative or binary variables. If they are quantitative, explanatory variables in \(X\) must be centered, standardized (if explanatory variables are in different units), transformed (to limit the skew of explanatory variables) or normalized (to linearize relationships) following the same principles as in PCA. Collinearity between the \(X\) variables should also be reduced before RDA. We began this process while exploring the data: our community data is Hellinger-transformed, and our environmental variables are centered and standardized. However, we still have some collinearity issues that have not been solved.

Sometimes, we have more explanatory variables than we need to understand the drivers of our response variable. The best way to build a model is always to use ecological reasoning when determining which variables should be included or excluded. However, if there are still too many variables included in the model, or if some of them are highly collinear, explanatory variables can be selected by forward, backward or stepwise selection that remove non-significant explanatory variables. However, this approach should always be taken after the ecological selection of variables, according to your knowledge of the system. We will see more about this later!

6.1 How the RDA works

Redundancy analysis as a two-step process (P. Legendre and Legendre 2012). The first step is a multiple regression, where each object in \(Y\) is regressed on the explanatory variables in \(X\), which results in a matrix of fitted values \(Y_{fit}\). This step is calculated through the following linear equation:

\[Y_{fit} = X[X'X]^{-1}X'Y\]

In the second step, we apply a principal components analysis (PCA) on the fitted matrix \(Y_{fit}\) to reduce dimensionality, i.e. to obtain the eigenvalues and eigenvectors. We then obtain a matrix \(Z\) which contains the canonical axes, which correspond to linear combinations of the explanatory variables in the space of \(X\). The linearity of the combinations of the \(X\) variables is a fundamental property of RDA. In the analysis of community composition, these canonical axes are interpreted as complex environmental gradients.

The computation process of an RDA, from Legendre & Legendre (2012).

Figure 6.2: The computation process of an RDA, from Legendre & Legendre (2012).

Once the RDA is computed, several statistics can be computed to interpret the explanatory power of the included variables and whether the observed relationships are significant. These include:

  • \(R^2\), which measures the strength of the canonical relationship between \(Y\) and \(X\) by calculating the proportion of the variation of \(Y\) explained by the variables in \(X\),
  • Adjusted \(R^2\), which also measures the strength of the relationship between \(Y\) and \(X\), but applies a correction of the \(R^2\) to take into account the number of explanatory variables. This is the statistic that should be reported.
  • The F-statistic corresponds to an overall test of significance of an RDA by comparing the computed model to a null model. This test is based on the null hypothesis that the strength of the linear relationship calculated by the \(R^2\) is not larger than the value that would be obtained for unrelated \(Y\) and \(X\) matrices of the same size. Note that F-statistics can also be used to sequentially test the significance of each canonical axis.

6.2 Running an RDA in R

An RDA can be computed using the function rda() from the vegan package, as follows:

Step 1: Standardize and/or transform the data.

We already applied a Hellinger transformation to our community matrix, and standardized our explanatory variables in the section: 4. However, we noticed that the variable das was collinear with several other variables. We will begin by removing this variable:

# We'll use our standardized environmental data, but we
# will remove 'das', which was correlated with many other
# variables:
env.z <- subset(env.z, select = -das)

Step 2: Run the RDA.

# Model the effect of all environmental variables on fish
# community composition
spe.rda <- rda(spe.hel ~ ., data = env.z)

Step 3: Extract key results of the RDA.

summary(spe.rda)

The first section of the summary contains the pieces we need to verify the performance of our RDA. Let us break it down:

...
## Partitioning of variance:
##               Inertia Proportion
## Total          0.5025     1.0000
## Constrained    0.3689     0.7341
## Unconstrained  0.1336     0.2659
...
  • Constrained Proportion: variance of \(Y\) explained by \(X\) (73.41%)
  • Unconstrained Proportion: unexplained variance in \(Y\) (26.59%)

How would you report these results? You could say: “The included environmental variables explain 73.41% of the variation in fish community composition across sites.”

The rest of the RDA summary is not printed here, because it is long. Aside from the section printed above, the summary contains:

  • Eigenvalues, and their contribution to the variance
  • Accumulated constrained eigenvalues, including the cumulative proportion of explained variance by each axis in the final RDA ordination. These axes represent the rescaled environmental variables. If you need to select a subset of axes for other analyses, you can use this cumulative proportion to select the first few axes until you reach a threshold of your choice.
  • Scores for species, sites, and the explanatory variables, which are the coordinates of each of these objects in the RDA space. The default scaling is of type 2 (we will come back to this).

6.2.1 Selecting variables

If we want to simplify this model, we can perform a forward selection (or backwards or stepwise). These types of selections help us select variables that are statistically important. However, it is important to note that selecting variables ecologically is much more important than performing selection in this way. If a variable of ecological interest is not selected, this does not mean it has to be removed from the RDA.

Here, we will be performing forward selection on our 11 environmental variables. To do this, we can use the ordiR2step() function (or using the forward.sel function of package packfor):

# Forward selection of variables:
fwd.sel <- ordiR2step(rda(spe.hel ~ 1, data = env.z), # lower model limit (simple!)
               scope = formula(spe.rda), # upper model limit (the "full" model)
               direction = "forward",
               R2scope = TRUE, # can't surpass the "full" model's R2
               pstep = 1000,
               trace = FALSE) # change to TRUE to see the selection process!

Here, we are essentially adding one variable at a time, and retaining it if it significantly increases the model’s adjusted \(R^2\).

Which variables are retained by the forward selection?

# Check the new model with forward-selected variables
fwd.sel$call
## rda(formula = spe.hel ~ alt + oxy + dbo, data = env.z)

What is the adjusted R2 of the RDA with the selected variables?

# Write our new model
spe.rda.signif <- rda(spe.hel ~ alt + oxy + dbo, data = env.z)
# check the adjusted R2 (corrected for the number of
# explanatory variables)
RsquareAdj(spe.rda.signif)
## $r.squared
## [1] 0.5894243
## 
## $adj.r.squared
## [1] 0.5401552

The explanatory variables (altitude, oxygen and biological oxygen demand) now explain 59% of the variance in \(Y\) (species abundances across sites, or community composition). When we correct for the number of variables in \(X\), the adjusted \(R^2\) tells us that three selected variables explain 54% of the variance in species abundances.

Because the adjusted \(R^2\) is corrected for the number of explanatory variables, it is comparable across models and datasets. For this reason, you should report the adjusted \(R^2\) when writing up the result of an RDA for an article, or in a study which compares the explanatory power of different models.

6.2.2 Significance testing

The significance of your RDA can be tested using the function anova.cca().

anova.cca(spe.rda.signif, step = 1000)
...
##          Df Variance      F Pr(>F)    
## Model     3  0.29619 11.963  0.001 ***
## Residual 25  0.20632                  
## ---
...

You can also test the significance of each variable with by = "term".

anova.cca(spe.rda.signif, step = 1000, by = "term")
...
## Model: rda(formula = spe.hel ~ alt + oxy + dbo, data = env.z)
##          Df Variance       F Pr(>F)    
## alt       1 0.164856 19.9759  0.001 ***
## oxy       1 0.082426  9.9877  0.001 ***
## dbo       1 0.048909  5.9264  0.002 ** 
## Residual 25 0.206319                   
...

You can also test the significance of each canonical axis with by = "axis". Recall that these axes represent the variation in explanatory variables in fewer dimensions.

anova.cca(spe.rda.signif, step = 1000, by = "axis")
...
## Model: rda(formula = spe.hel ~ alt + oxy + dbo, data = env.z)
##          Df Variance       F Pr(>F)    
## RDA1      1 0.218022 26.4181  0.001 ***
## RDA2      1 0.050879  6.1651  0.001 ***
## RDA3      1 0.027291  3.3069  0.001 ***
## Residual 25 0.206319                   
...

Our full model is statistically significant (p = 0.001), and every variable included in this model is significant as well (p = 0.001). Every canonical axis resulting from the RDA is also statistically significant (p = 0.001).

6.2.3 RDA plot

One of the most powerful aspects of RDA is the simultaneous visualization of your response and explanatory variables (i.e. species and environmental variables).

As with the PCA in (Workshop 9)[https://github.com/QCBSRworkshops/workshop09], there are two types of scaling:

Type 1 Type 2
Distances among objects reflect their similarities Angles between variables reflect their correlation
# Type 1 scaling
ordiplot(spe.rda.signif, scaling = 1, type = "text")
# Type 2 scaling
ordiplot(spe.rda.signif, scaling = 2, type = "text")

Scaling 1 shows similarities between objects in the response matrix.

  • Sites (numbers) that are closer together have more similar communities.
  • Species that are closer together occupy more sites in common.

Scaling 2 shows the effects of explanatory variables.

  • Longer arrows mean this variable strongly drives the variation in the community matrix.
  • Arrows pointing in opposite directions have a negative relationship.
  • Arrows pointing in the same direction have a positive relationship.

6.2.3.1 Customizing RDA plots

Both plot() and ordiplot() make quick and simple ordination plots, but you can customize your plots by extracting scores with scores() and manually setting the aesthetics of points(), text(), and arrows(). Here is an example of a custom triplot. Feel free to play around with the colours and other parameters to make it your own!

# Custom triplot code!

## extract % explained by the first 2 axes
perc <- round(100*(summary(spe.rda.signif)$cont$importance[2, 1:2]), 2)

## extract scores - these are coordinates in the RDA space
sc_si <- scores(spe.rda.signif, display="sites", choices=c(1,2), scaling=1)
sc_sp <- scores(spe.rda.signif, display="species", choices=c(1,2), scaling=1)
sc_bp <- scores(spe.rda.signif, display="bp", choices=c(1, 2), scaling=1)

## Custom triplot, step by step

# Set up a blank plot with scaling, axes, and labels
plot(spe.rda.signif,
     scaling = 1, # set scaling type 
     type = "none", # this excludes the plotting of any points from the results
     frame = FALSE,
     # set axis limits
     xlim = c(-1,1), 
     ylim = c(-1,1),
     # label the plot (title, and axes)
     main = "Triplot RDA - scaling 1",
     xlab = paste0("RDA1 (", perc[1], "%)"), 
     ylab = paste0("RDA2 (", perc[2], "%)") 
)
# add points for site scores
points(sc_si, 
       pch = 21, # set shape (here, circle with a fill colour)
       col = "black", # outline colour
       bg = "steelblue", # fill colour
       cex = 1.2) # size
# add points for species scores
points(sc_sp, 
       pch = 22, # set shape (here, square with a fill colour)
       col = "black",
       bg = "#f2bd33", 
       cex = 1.2)
# add text labels for species abbreviations
text(sc_sp + c(0.03, 0.09), # adjust text coordinates to avoid overlap with points 
     labels = rownames(sc_sp), 
     col = "grey40", 
     font = 2, # bold
     cex = 0.6)
# add arrows for effects of the expanatory variables
arrows(0,0, # start them from (0,0)
       sc_bp[,1], sc_bp[,2], # end them at the score value
       col = "red", 
       lwd = 3)
# add text labels for arrows
text(x = sc_bp[,1] -0.1, # adjust text coordinate to avoid overlap with arrow tip
     y = sc_bp[,2] - 0.03, 
     labels = rownames(sc_bp), 
     col = "red", 
     cex = 1, 
     font = 2)

6.3 Challenge 1

Run an RDA to model the effects of environmental variables on mite species abundances.

The mite dataset is part of the vegan package, so you do not need to have it stored as a .csv in your repository.

To get started, load the mite data:

# Load mite species abundance data
data("mite")

# Load environmental data
data("mite.env")

Recall some useful functions:

decostand()
rda()
ordiR2step()
anova.cca()
ordiplot()

6.3.1 Challenge 1: Solution

Step 1: Transform and standardize the data.

# Hellinger transform the community data
mite.spe.hel <- decostand(mite, method = "hellinger")

# Standardize quantitative environmental data
mite.env$SubsDens <- decostand(mite.env$SubsDens, method = "standardize")
mite.env$WatrCont <- decostand(mite.env$WatrCont, method = "standardize")

Step 2: Select environmental variables.

# Initial RDA with ALL of the environmental data
mite.spe.rda <- rda(mite.spe.hel ~ ., data = mite.env)

# Forward selection of environmental variables
fwd.sel <- ordiR2step(rda(mite.spe.hel ~ 1, data = mite.env),
    scope = formula(mite.spe.rda), direction = "forward", R2scope = TRUE,
    pstep = 1000, trace = FALSE)
fwd.sel$call
## rda(formula = mite.spe.hel ~ WatrCont + Shrub + Substrate + Topo + 
##     SubsDens, data = mite.env)

Step 3: Run the RDA and check its explanatory power.

# Re-run the RDA with the significant variables
mite.spe.rda.signif <- rda(mite.spe.hel ~ WatrCont + Shrub +
    Substrate + Topo + SubsDens, data = mite.env)

# Find the adjusted R2 of the model with the retained env
# variables
RsquareAdj(mite.spe.rda.signif)$adj.r.squared
## [1] 0.4367038

Step 4: Test model significance.

anova.cca(mite.spe.rda.signif, step = 1000)
## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = mite.spe.hel ~ WatrCont + Shrub + Substrate + Topo + SubsDens, data = mite.env)
##          Df Variance     F Pr(>F)    
## Model    11  0.20759 5.863  0.001 ***
## Residual 58  0.18669                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We find that four explanatory variables are retained after forward selection: WatrCont, Shrub, Substrate, and Topo. The selected environmental variables significantly explain 43.7% (p = 0.001) of the variation in mite species abundances.

Step 5: Plot the RDA results!

# Scaling 1
ordiplot(mite.spe.rda.signif, scaling = 1, main = "Mite RDA - Scaling 1")
# Scaling 2
ordiplot(mite.spe.rda.signif, scaling = 2, main = "Mite RDA - Scaling 2")

Scaling 1 shows similarities between objects in the response matrix.

  • The sites (black circles) are overall similar
  • A few species (red +) stand out from the cluster near the middle, meaning they do not occupy many sites in common with other species. These species might therefore be rare or unique in some ecological way.

Scaling 2 shows the effects of explanatory variables.

  • Substrate density and water content have long arrows, and therefore strong effects
  • Shrubs have an opposite effect to WatrCont and SubsDens, because Shrub arrows are in opposite directions to these variables.
  • Sites vary a lot in terms of SubsDens, and less in terms of WatrCont.

References

Legendre, P., and Louis Legendre. 2012. Numerical Ecology. Elsevier.