Workshop 9: Multivariate analyses

class: center, middle, inverse, title-slide

# Workshop 9: Multivariate analyses
## QCBS R Workshop Series
### Québec Centre for Biodiversity Science

---

class: inverse, center, middle

# 1. Introduction
## What is ordination?

---
# One Dimension

What if we are interested in this response for different species of algae involved in the algal bloom density?

.center[![:scale 70%](images/algalBloom.png)]

---
# Two Dimensions

.center[![:scale 70%](images/2dim.png)]

---
# Three Dimensions

.center[
![:scale 70%](images/3dim.png)]

---
# 4,5,6, or more Dimensions

.center[![:scale 70%](images/4dim.png)]

---
# Ordination in reduced space

.center[![:scale 70%](images/Ord1.png)]

---
# Ordination in reduced space

.center[![:scale 70%](images/Ord2.png)]

- Matrix algebra is complex and hard to understand

- A global understanding is enough in order to use ordination methods adequately

---
# Methods for scientific research

--
- **Questions / Hypothesis**
--

- **Experimental design**

--
- **Data Collection**
--

- **Transformation / Distance**
--

- **Analysis**
--

- **Redaction**
--

- **Communication**

---
class: inverse, center, middle
# 2. Exploring data

---
# Doubs River Fish Dataset

.pull-left[

Verneaux (1973) dataset:
- characterization of fish communities
- 27 different species
- 30 different sites
- 11 environmental variables

]

.pull.right[
![:scale 50%](images/DoubsRiver.png)
]

---
# Doubs River Fish Dataset

Load the Doubs River species data (Doubs.Spe.csv)

```r
spe <- read.csv("data/doubsspe.csv", row.names = 1)
spe <- spe[-8] # remove site with no data
```

Load the Doubs River environmental data (Doubs.Env.csv)

```r
env <- read.csv("data/doubsenv.csv", row.names = 1) 
env <- env[-8] # remove site with no data
```

.alert[Proceed with caution, only execute once]

---
# Expore Doubs Dataset

Explore the content of the fish community dataset

```r
names(spe) # Names of objects
dim(spe) # dimensions
str(spe) # structure of objects
summary(spe) # summary statistics
head(spe) # first 6 rows
```

```
#   CHA TRU VAI LOC OMB BLA HOT VAN CHE BAR SPI GOU BRO PER BOU PSO ROT CAR
# 1   0   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0   0
# 2   0   5   4   3   0   0   0   0   0   0   0   0   0   0   0   0   0   0
# 3   0   5   5   5   0   0   0   0   0   0   0   0   1   0   0   0   0   0
# 4   0   4   5   5   0   0   0   0   1   0   0   1   2   2   0   0   0   0
# 5   0   2   3   2   0   0   0   5   2   0   0   2   4   4   0   0   2   0
# 6   0   3   4   5   0   0   0   1   2   0   0   1   1   1   0   0   0   0
#   TAN BCO PCH GRE GAR BBO ABL ANG
# 1   0   0   0   0   0   0   0   0
# 2   0   0   0   0   0   0   0   0
# 3   0   0   0   0   0   0   0   0
# 4   1   0   0   0   0   0   0   0
# 5   3   0   0   0   5   0   0   0
# 6   2   0   0   0   1   0   0   0
```

---
# Species Frequencies

Take a look at the distribution of species frequencies

```r
ab <- table(unlist(spe))
barplot(ab, las = 1, col = grey(5:0/5),
 xlab = "Abundance class", ylab = "Frequency")
```

.alert[Note the proportion of 0s]

---
# Species Frequencies

How many zeros?

```r
sum(spe == 0)
# [1] 416
```

What proportion of zeros?

```r
sum(spe == 0)/(nrow(spe)*ncol(spe))
# [1] 0.5333333
```

---
# Total Species Richness

Visualize how many species are present at each site:

```r
site.pre <- rowSums(spe > 0)
barplot(site.pre, main = "Species richness",
 xlab = "Sites", ylab = "Number of species",
 col = "grey ", las = 1)
```

---
# Understand your data!

.center[...to choose the appropiate transformation and distance]

- Are there many zeros?

- What do they mean?

.alert[A measured 0 (e.g 0mg/L, 0°C) is not the same than a 0 representing an absence observations]

---
# Before transforming your community data...

.alert[Important considerations:]

--
- relative abundances/counts/presence-absence?

--
- asymmetrical distributions?

--
- many rare species?

--
- overabundance of dominant species?

--
- double Zero problem?

---
# Transforming community data

.center[
![](images/trans1.png)]

---
# Transforming your community data

## Examples

Transforming counts into presence - absence

```r
spec.pa <- decostand(spe,method = "pa")
```

Reducing the weight of rare species

```r
spec.hel <- decostand(spe,method = "hellinger")
spec.chi <- decostand(spe,method = "chi.square")
```

Reducing the weight of very abundant species

```r
spe.pa <- decostand(spe,method = "log")
```

---
# Doubs Environmental Data

```r
names(env) # Names of objects
dim(env) # dimensions
str(env) # structure of objects
summary(env) # summary statistics
head(env) # first 6 rows
```

```r
head(env) # first 6 rows
#    das alt  pen  deb  pH dur  pho  amm  oxy dbo
# 1  0.3 934 48.0 0.84 7.9  45 0.01 0.00 12.2 2.7
# 2  2.2 932  3.0 1.00 8.0  40 0.02 0.10 10.3 1.9
# 3 10.2 914  3.7 1.80 8.3  52 0.05 0.05 10.5 3.5
# 4 18.5 854  3.2 2.53 8.0  72 0.10 0.00 11.0 1.3
# 5 21.5 849  2.3 2.64 8.1  84 0.38 0.20  8.0 6.2
# 6 32.4 846  3.2 2.86 7.9  60 0.20 0.00 10.2 5.3
```

Explore colinearity by visualizing correlations between variables

```r
pairs(env, main = "Bivariate Plots of the Environmental Data")
```

---
# Doubs Environmental Data

![](images/EnvDat1.png)

---
# Standardization

Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units

```r
## ?decostand
env.z <- decostand(env, method = "standardize")
```

This centers and scales the variables to make your downstream analysis more appropriate

```r
apply(env.z, 2, mean)
#           das           alt           pen           deb            pH 
#  1.000429e-16  1.814232e-18 -1.659010e-17  1.233099e-17 -4.096709e-15 
#           dur           pho           amm           oxy           dbo 
#  3.348595e-16  1.327063e-17 -4.289646e-17 -2.886092e-16  7.656545e-17
apply(env.z, 2, sd)
# das alt pen deb  pH dur pho amm oxy dbo 
#   1   1   1   1   1   1   1   1   1   1
```

---
class: inverse, center, middle
# 3. Similarity / Dissimilarity

---
# Association measures

Matrix algebra is at the heart of all ordinations

.center[![](images/MatrixAlgebra.png)]

- Exploring various measures of distance between objects provides some understanding of the engine under the hood

---
# Breaking out of 1D

.pull-left[

- As you have seen, ecological datasets can sometimes be very large matrices

- Ordinations compute the relationships between species or between sites

- We can simplify these relationships using methods of dissimilarity
]

.pull-right[

![:scale 40%](images/PCAMatrix.png)

![:scale 40%](images/distMes.png)

![:scale 40%](images/distMat.png)
]

---
# Similarity / Dissimilarity

- Useful to understand your dataset
- Appropriate measure required by some types of ordinations

.center[
Similarity: S = 1 - D
Distance: D = 1-S]

![](images/similarity.png)

---
# Community distance measures

.pull-left[
- Euclidean
- Manhattan
- Chord
]

.pull-right[
- Hellinger
- Chi-square
- Bray-Curtis
]

--

.alert[Each of these will be useful in different situations]

---
# Comparing Doubs Sites

The `vegdist()` function contains all common distances

```r
?vegdist
```

How different is the community composition across the 30 sites of the Doubs River?

```r
spe.db.pa <- vegdist(spe, method = "bray")
```

---
# Comparing Doubs Sites

.center[![](images/Doubs1.png)]

---
# Comparing Doubs Sites

.center[![](images/Doubs2.png)]

---
# Visualization of distance matrices

.center[![](images/MatrxViz.png)]

---
# Challenge #1 ![:cube]()

Discuss with your neighbor:

.center[**How can we tell how similar objects are when we have multivariate data?**]

- Make a list of all your suggestions

---
# And what about ordination?

With ordination methods, we order our objects (site) according to their similarity

- The more the sites are similar, the closer they are in the ordination space (smaller distances)

- In Ecology, we usually calculate the similarity between sites according to their species composition or their environmental conditions.

---
# Schematic analysis of multivariate analysis

.center[:scale 70%
![](images/ Schema1.png)]

---
# Clustering

- To highlight structures in the data by partitioning either objects or the descriptors

- Results are represented as dendrograms (trees)

- Not a statistical method

.center[
![:scale 80%](images/cluster1.png)]

---
# Overview of 3 hierarchical methods

- Single linkage agglomerative clustering

- Complete linkage, agglomerative clustering

- Ward's minimum variance clustering

- Elements of lower are nested in higher ranking clusters
   - (e.g. species, genus, family, order)

---
# Hierarchical methods

A distance matrix is first sorted in increasing distance order

![](images/Hierachic1.png)

---
# Single linkage clustering

.pull-left[

![:scale 50%](images/singleClust1.png)
--

]

.pull-right[

- The two closest objects merge

- The next two closest objects/clusters merge

- and so on

![](images/singleClust2.png)

]

---
# Complete linkage clustering

.pull-left[

![:scale 50%](images/compleClust1.png)

]

.pull-right[

- The two closest objects merge

- The next two objects/cluster will agglomerate when linked to the furthest element of the group

![](images/compleClust2.png)
]

---
# Comparison

Create a distance matrix from Hellinger transformed Doubs river data and compute the single linkage clustering

```r
spe.dhe1 <- vegdist(spec.hel, method = "euclidean")
spe.dhe1.single <- hclust(spe.dhe1, method = "single")
plot(spe.dhe1.single)
```

---
# Comparison

![](images/comparison.png)

.pull-left[

**Single linkage:**

Chains of objects occur (e.g. 19,29,30,26)
]

.pull-right[

**Complete linkage:**
Contrasted groups are formed of objects occur
]

---
# Ward's minimum variance method

- Uses the criterion of least squares to cluster objects into groups
  - At each step, the pair of clusters merging is the one leading to the minimum increase in total within-group sum of squares

---
# Ward's method

Compute the Ward's minimum variance clustering and plot the dendrogram by using the square root of the distances:

```r
spe.dhel.ward <- hclust(spe.dhe1, method = "ward.D2")
spe.dhel.ward$height <- sqrt(spe.dhel.ward$height)
plot(spe.dhel.ward, hang = -1) # hang = -1 aligns objects at the same level
```

---
# Ward's method

Clusters generated using this method tend to be more spherical and to contain similar number of objects

---
# How to choose the right method?

- Depends on the objective
  - highlights gradients? contrasts?
- If more than on method seems appropriate, compare dendrograms
- Again: this is **not** an statistical method
 But! is possible to:
  - determine the optimal number of interpretable clusters
  - compute clustering statistics
  - combine clustering to ordination to distinguish groups of sites

---
class: inverse, center, middle
# 4. Unconstrained ordination

---
# Definitions

- **Variance:** measure of a variable **y** *j* dispersion from its mean
--

- **Co-variance:** measure of co-dispersion of variables **y** *j* et **y** *j* from their means

--
- **Correlation:** measure of the link strength between 2 variables: rij = (dij / dj . dk)

--
- **Eigenvalues:** Proportio of variance (dispersion) represented by one ordination axe.

--
- **Orthogonality:** right angle between 2 axis or 2 arrows which means that these 2 are independent = non correlated.

--
- **Score:** position of a dot on an axis. All the scores of a dot give its coordinates in the multidimensional space. They can be used as new variable for other analyses (e.g. linear combination of measured variables).

--
- **Dispersion** (inertia): Measure of the total variability of the scatter plot (descriptors) in the multidimensional space with regards to its center of gravity.

---
# Unconstrained ordination

- Asses relationships **within** a set of variables (species or environmental variables, not **between** sets, i.e. constrained analysis)

- Find key components of variation between samples, sites, species, etc... ç

- Reduce the number of dimensions in multivariate data without substantial loss of information

- Create new variables for use in subsequent analysis (such as regression)

---
# 4.1. Principal Component Analysis (PCA)

.center[
![:scale 90%](images/Ord1.png)]

- Preserves, in 2D, the maximum amount of variation in the data
- The resulting, synthetic variables are orthogonal (and therefore uncorrelated)

---
# PCA - What you need

- A set of variables that are response variables (e.g. community composition) OR explanatory variables (e.g. environmental variables)

**NOT BOTH!**

.pull-left[
- Samples that are measured for the same set of variables
- Generally a dataset that is longer than it is wide is preferred
]

.pull-right[
![:scale 80%](images/PCAMatrix.png)
]

---
# PCA - Walkthrough

|Site|Species 1| Species 2|
|---|------|------|
|A|7|3|
|B|4|3|
|C|12|10|
|D|23|11|
|E|13|13|
|F|15|16|
|G|18|14|

.alert[ A simplified example ]

---
# PCA - Walkthrough

.center[
![:scale 70%](images/DispPlot.png)]

.small[
.alert[In 2D, we would plot the sites like this... Notice the dispersion in the scatterplot]]

---
# PCA - Walkthrough

.center[
![:scale 60%](images/PCA1.png)]

.small[
.alert[Our first component is essentially drawn trough the maximum amount of observed variation... or the best fit line through the points]]

---
# PCA - Walkthrough

.center[
![:scale 70%](images/PCA2.png)]
.small[
.alert[A second principal component is then added perpendicular (90 degrees in 2D) to the first axis]]

---
# PCA - Walkthrough

.center[
![:scale 70%](images/PCA3.png)]

.small[ The final plot then is the two PC axes rotated where the axes are now principal components as opposed to species]
---
# PCA - Multidimensional case

- **PC1** --> axis that maximizes the variance of the points that are projected perpendicularly onto the axis.
- **PC2** --> must be perpendicular to PC1, but the direction is again the one in which variance is maximized when points are perpendicularly projected
- **PC3** --> and so on: perpendicular to the first two axes

.alert[When there are more than two dimensions, PCA produces a new spaces in which all PCA axes are orthogonal (i.e. non-correlated)  and where the PCA axes are ordered according to the percent of variance of the original data they explain]

---
# PCA - Let's try it on Fish Species!

- For both PCA and RDA, we will be using the `rda()` function in the vegan package

- Run a PCA on the Hellinger-transformed fish data and extract the results

```r
spe.h.pca <- rda(spec.hel)

summary(spe.h.pca)
# 
# Call:
# rda(X = spec.hel) 
# 
# Partitioning of variance:
#               Inertia Proportion
# Total          0.4978          1
# Unconstrained  0.4978          1
# 
# Eigenvalues, and their contribution to the variance 
# 
# Importance of components:
#                          PC1     PC2     PC3     PC4     PC5     PC6
# Eigenvalue            0.2491 0.06455 0.04615 0.03717 0.02148 0.01617
# Proportion Explained  0.5003 0.12968 0.09271 0.07468 0.04315 0.03249
# Cumulative Proportion 0.5003 0.63002 0.72272 0.79740 0.84055 0.87304
#                           PC7     PC8     PC9     PC10     PC11     PC12
# Eigenvalue            0.01382 0.01239 0.01002 0.006678 0.005053 0.004269
# Proportion Explained  0.02777 0.02488 0.02012 0.013416 0.010151 0.008577
# Cumulative Proportion 0.90081 0.92569 0.94581 0.959230 0.969381 0.977958
#                           PC13     PC14     PC15     PC16    PC17
# Eigenvalue            0.002831 0.002252 0.001411 0.001218 0.00105
# Proportion Explained  0.005686 0.004524 0.002834 0.002448 0.00211
# Cumulative Proportion 0.983645 0.988168 0.991002 0.993449 0.99556
#                            PC18      PC19      PC20      PC21      PC22
# Eigenvalue            0.0007434 0.0004759 0.0003888 0.0002236 0.0001748
# Proportion Explained  0.0014934 0.0009561 0.0007810 0.0004491 0.0003512
# Cumulative Proportion 0.9970528 0.9980089 0.9987898 0.9992390 0.9995902
#                            PC23      PC24      PC25      PC26
# Eigenvalue            0.0001057 5.548e-05 2.929e-05 1.349e-05
# Proportion Explained  0.0002124 1.115e-04 5.885e-05 2.711e-05
# Cumulative Proportion 0.9998026 9.999e-01 1.000e+00 1.000e+00
# 
# Scaling 2 for species and site scores
# * Species are scaled proportional to eigenvalues
# * Sites are unscaled: weighted dispersion equal on all dimensions
# * General scaling constant of scores:  1.949211 
# 
# 
# Species scores
# 
#          PC1       PC2      PC3       PC4       PC5        PC6
# CHA  0.17364 -0.082936  0.05158 -0.262138  0.027112 -0.0219580
# TRU  0.64261 -0.008315  0.23864  0.128658  0.064406  0.0432509
# VAI  0.51186 -0.207900 -0.15484 -0.015673 -0.110164  0.1274353
# LOC  0.38018 -0.238881 -0.22103  0.042724 -0.128345  0.0677662
# OMB  0.16736 -0.060334  0.08072 -0.255130 -0.020748  0.0011316
# BLA  0.08001 -0.145395  0.03293 -0.237964  0.106591 -0.0400894
# HOT -0.18605 -0.053178  0.04437 -0.026829 -0.072366  0.0502470
# VAN -0.11488 -0.196849 -0.11576  0.018305  0.199229  0.0465976
# CHE -0.10176  0.066882 -0.29113 -0.096710 -0.026554 -0.0230568
# BAR -0.19806 -0.212636  0.07778 -0.106483 -0.004984 -0.0262403
# SPI -0.17595 -0.160953  0.05178 -0.044373 -0.030874 -0.0153802
# GOU -0.23423 -0.138148 -0.05304 -0.004245  0.143140  0.1262663
# BRO -0.15369 -0.154576 -0.01095  0.113432  0.096356  0.0467527
# PER -0.15809 -0.200473 -0.01875  0.092353  0.042188 -0.0476916
# BOU -0.22980 -0.136252  0.08350  0.002992 -0.081337  0.0001911
# PSO -0.22809 -0.084724  0.07060 -0.027358 -0.060937  0.0263001
# ROT -0.19453 -0.042185  0.01717  0.067857  0.068756  0.0392512
# CAR -0.18673 -0.130430  0.07035 -0.007269 -0.037799 -0.0222172
# TAN -0.19401 -0.190001 -0.07720  0.085307 -0.009879 -0.1172125
# BCO -0.20362 -0.084493  0.08619  0.043058 -0.073993 -0.0029413
# PCH -0.14835 -0.053369  0.08168  0.035791 -0.049362 -0.0158830
# GRE -0.30428  0.006539  0.07341  0.002335 -0.030224  0.1232416
# GAR -0.35586  0.075284 -0.20595 -0.007679 -0.021996 -0.0803960
# BBO -0.24547 -0.037741  0.08761  0.026190 -0.096283  0.0396803
# ABL -0.42892  0.221521 -0.02675 -0.113606 -0.007041  0.1792154
# ANG -0.20741 -0.116253  0.08123  0.007865 -0.072704 -0.0050821
# 
# 
# Site scores (weighted sums of species scores)
# 
#          PC1      PC2      PC3       PC4      PC5      PC6
# 1   0.368774  0.55429  0.99759  0.542574  0.51593 -0.42032
# 2   0.504389  0.07423  0.18986  0.425221 -0.40271  0.32596
# 3   0.461810 -0.02363  0.09453  0.495853 -0.32363  0.40381
# 4   0.295783 -0.21679 -0.19198  0.545337 -0.01251  0.12671
# 5  -0.008008 -0.18805 -0.46779  0.543360  0.78960 -0.35568
# 6   0.208872 -0.20536 -0.49515  0.460652  0.13857 -0.10873
# 7   0.437641  0.01376 -0.16055  0.330082 -0.16760  0.29588
# 8   0.030738  0.57116  0.32010  0.089125  0.12309 -0.77069
# 9   0.035950  0.29368 -0.95291  0.006257 -0.60350 -0.95417
# 10  0.295900 -0.09270 -0.54608  0.152831  0.08189  0.51362
# 11  0.467883  0.11496  0.08367 -0.314428 -0.35013  0.09313
# 12  0.477279  0.06986  0.10452 -0.314976 -0.38090  0.05913
# 13  0.485431 -0.01028  0.40744 -0.586902 -0.05978 -0.04271
# 14  0.371535 -0.16370  0.19980 -0.665103  0.16808  0.04943
# 15  0.275775 -0.27593 -0.07737 -0.632601  0.48078 -0.14837
# 16  0.101719 -0.45672 -0.13330 -0.298335  0.62205 -0.32559
# 17 -0.039484 -0.41023 -0.01930 -0.408518  0.05691 -0.10018
# 18 -0.127456 -0.37740 -0.02422 -0.389726 -0.00412  0.02359
# 19 -0.268321 -0.32170 -0.14988 -0.056412 -0.19112  0.22028
# 20 -0.381755 -0.20805 -0.02078  0.044530 -0.16536  0.07815
# 21 -0.413089 -0.21936  0.13261  0.121856 -0.15856  0.07454
# 22 -0.447976 -0.15857  0.17990  0.118291 -0.09954 -0.07063
# 23 -0.249170  1.03336 -0.43922 -0.377955 -0.05535 -0.16315
# 24 -0.367283  0.77651 -0.05334 -0.302723 -0.21330  0.63323
# 25 -0.333926  0.53653 -0.24020 -0.022704  0.99813  0.79583
# 26 -0.452419 -0.06651  0.15589  0.090145 -0.20350  0.07359
# 27 -0.449205 -0.12000  0.19443  0.115537 -0.17655 -0.04474
# 28 -0.451189 -0.11625  0.22578  0.125755 -0.23545 -0.07325
# 29 -0.358289 -0.25816  0.32430 -0.020464 -0.11775  0.03123
# 30 -0.471910 -0.14896  0.36164  0.183443 -0.05369 -0.21992
```

---
# Function `rda()`

- RDA is in 2 steps

- multiple regressions
  - PCA on regressed values

- If we give only one table to the function `rda()` it does directly a PCA without doing regression

.center[
.alert[ rda(Y~X) ![:faic](arrow-right) RDA

rda(Y) or rda(X) ![:faic](arrow-right) PCA ]]

---
#PCA - Interpretation of Output

.center[
![:scale 80%](images/PCAout1.png)]

- Total variance explained by the descriptors (here the fish species)
- In PCA, not that the "Total" and "Unconstrained" portion of the explained variance is identical