QCBS members who contributed to this workshop
by modifying and improving its content as part of the
Learning and Development Award
2022 - 2021 - 2020
Pedro Henrique P. Braga
Katherine Hébert
Mi Lin
Linley Sherin
2019 - 2018 - 2017
Gabriel Muñoz
Marie Hélène-Brice
Pedro Henrique P. Braga
2016 - 2015 - 2014
Bérenger Bourgeois
Xavier Giroux-Bougard
Amanda Winegardner
Emmanuelle Chrétien
Monica Granados
If you would like to contribute too, visit r.qcbs.ca/contributing
and don't hesitate to get in touch with us!
This workshop requires the latest RStudio and R versions.
To install them from CRAN, run:
install.packages(c("ape", "ade4", "codep", "gclus", "vegan", "GGally", "PlaneGeometry", "remotes"))
Throughout this workshop, there will be a series of challenges that you can recognize by this Rubik's cube.
During these challenges, do not hesitate to collaborate!
Learn the basics of multivariate analysis to reveal patterns in community composition data
Use R
to perform an unconstrained ordination
Learn about similarity and dissimilarity coefficients and transformations to perform multivariate analysis
Use R
to create dendrograms
Learn the following methods:
We have learned a multitude of analyses that allowed us to interpret ecological data while depicting the effects of one or multiple variables in one response variable.
We have learned a multitude of analyses that allowed us to interpret ecological data while depicting the effects of one or multiple variables in one response variable.
We can recall the:
General Linear Models
lm()
;anova()
;t.test()
;lmer()
.Generalized Linear Models
glm()
and glmer()
with several family()
link functions.Generalized Additive Models
gam()
.We have learned a multitude of analyses that allowed us to interpret ecological data while depicting the effects of one or multiple variables in one response variable.
We can recall the:
General Linear Models
lm()
;anova()
;t.test()
;lmer()
.Generalized Linear Models
glm()
and glmer()
with several family()
link functions.Generalized Additive Models
gam()
.These models allowed us to ask questions such as:
However, one may be interested in making inferences from ecological data containing more than one outcome or dependent variable.
This interest may be driven by hypothesis testing and modelling, but also be entirely exploratory.
However, one may be interested in making inferences from ecological data containing more than one outcome or dependent variable.
This interest may be driven by hypothesis testing and modelling, but also be entirely exploratory.
For instance, our research question might be:
How does the bacterial composition on maple leaves change along the elevational gradient?
What is the composition dissimilarity of bat communities?
How closely-related are local spider communities in terms of their composition?
However, one may be interested in making inferences from ecological data containing more than one outcome or dependent variable.
This interest may be driven by hypothesis testing and modelling, but also be entirely exploratory.
For instance, our research question might be:
How does the bacterial composition on maple leaves change along the elevational gradient?
What is the composition dissimilarity of bat communities?
How closely-related are local spider communities in terms of their composition?
In all these cases, the outcome is composed of several variables, e.g. usually a sample-by-species or sample-by-environment matrix.
However, one may be interested in making inferences from ecological data containing more than one outcome or dependent variable.
This interest may be driven by hypothesis testing and modelling, but also be entirely exploratory.
For instance, our research question might be:
How does the bacterial composition on maple leaves change along the elevational gradient?
What is the composition dissimilarity of bat communities?
How closely-related are local spider communities in terms of their composition?
In all these cases, the outcome is composed of several variables, e.g. usually a sample-by-species or sample-by-environment matrix.
We will now dive into multivariate statistics, a tool set that will allow us to address questions requiring the simultaneous observation or analysis of more than one outcome variable.
Association (or dis-similarity) measures and matrices
Classification (or cluster) analysis
Unconstrained ordination
Constrained (or canonical) ordination
Association (or dis-similarity) measures and matrices
Classification (or cluster) analysis
Unconstrained ordination
Constrained (or canonical) ordination
But, before that, we must recall the basics of matrix algebra.
Matrix algebra is well-suited for ecology, because most (if not all) data sets we work with are in a matrix format.
Matrix algebra is well-suited for ecology, because most (if not all) data sets we work with are in a matrix format.
Ecological data tables are obtained as object-observations or sampling units, and are often recorded as this:
Objects | y1 | y2 | … | yn |
---|---|---|---|---|
x1 | y1,1 | y1,2 | … | y1,n |
x2 | y2,1 | y2,2 | … | y2,n |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
xm | ym,1 | ym,2 | … | ym,n |
where xm is the sampling unit m; and yn is the ecological descripor that can be, for example, species present in a sampling unit, locality, or a chemical variable.
Matrix algebra is well-suited for ecology, because most (if not all) data sets we work with are in a matrix format.
Ecological data tables are obtained as object-observations or sampling units, and are often recorded as this:
Objects | y1 | y2 | … | yn |
---|---|---|---|---|
x1 | y1,1 | y1,2 | … | y1,n |
x2 | y2,1 | y2,2 | … | y2,n |
⋮ | ⋮ | ⋮ | ⋱ | ⋮ |
xm | ym,1 | ym,2 | … | ym,n |
where xm is the sampling unit m; and yn is the ecological descripor that can be, for example, species present in a sampling unit, locality, or a chemical variable.
The same ecological data table can be represented in matrix notation like this:
Y=[ym,n]=[y1,1y1,2⋯y1,ny2,1y2,2⋯y2,n⋮⋮⋱⋮ym,1ym,2⋯ym,n]
where lowercase letters indicate elements, and the subscript letters indicate the position of these elements in the matrix (and in the table!).
Matrix algebra is well-suited for ecology, because most (if not all) data sets we work with, are in a matrix format.
Moreover, any subset of a matrix can be recognized!
a row matrix [y1,1y1,2⋯y1,n]
a column matrix [y1,1y2,1⋮ym,1]
The same ecological data table can be represented in matrix notation like this:
Y=[ym,n]=[y1,1y1,2⋯y1,ny2,1y2,2⋯y2,n⋮⋮⋱⋮ym,1ym,2⋯ym,n]
where lowercase letters indicate elements, and the subscript letters indicate the position of these elements in the matrix (and in the table!).
Two important matrices can be derived from the ecological data matrix: the association matrix among objects and the association matrix among descriptors.
Two important matrices can be derived from the ecological data matrix: the association matrix among objects and the association matrix among descriptors.
Using the data from our matrix Y,
one can examine the relationship between the first two objects:
and obtain a1,2.
Two important matrices can be derived from the ecological data matrix: the association matrix among objects and the association matrix among descriptors.
Using the data from our matrix Y,
one can examine the relationship between the first two objects:
and obtain a1,2.
We can populate the association matrix An,n with the relationships between all objects from Y:
Because An,n has the same number of rows and columns, it is denoted a square matrix.
Therefore, it has n2 elements.
Two important matrices can be derived from the ecological data matrix: the association matrix among objects and the association matrix among descriptors.
Two important matrices can be derived from the ecological data matrix: the association matrix among objects and the association matrix among descriptors.
We can also obtain the relationship between the first two descriptors of Y, y1 and y2:
[y1,2y2,2⋮ym,2]
[y1,1y2,1⋮ym,1]
and store it in (a_{1,2})
.
Two important matrices can be derived from the ecological data matrix: the association matrix among objects and the association matrix among descriptors.
We can also obtain the relationship between the first two descriptors of Y, y1 and y2:
[y1,2y2,2⋮ym,2]
[y1,1y2,1⋮ym,1]
and store it in a1,2.
We can populate the association matrix Am,m with the relationships between all descriptors from Y:
Am,m=[a1,1a1,2⋯a1,ma2,1a2,2⋯a2,m⋮⋮⋱⋮am,1am,2⋯am,m]
This Am,m is a square matrix, and it has m2 elements.
Two important matrices can be derived from the ecological data matrix: the association matrix among objects and the association matrix among descriptors.
These matrices are the basis of Q-mode and R-mode analyses in ecology.
Ym,n=[y1,1y1,2⋯y1,ny2,1y2,2⋯y2,n⋮⋮⋱⋮ym,1ym,2⋯ym,n]
Am,m=[a1,1a1,2⋯a1,ma2,1a2,2⋯a2,m⋮⋮⋱⋮am,1am,2⋯am,m]
An,n=[a1,1a1,2⋯a1,na2,1a2,2⋯a2,n⋮⋮⋱⋮an,1am,2⋯an,n]
← R-mode analysis for descriptors or species
↑ Q-mode analysis for objects or sites
We will dive into R-mode and Q-mode analyses, and we will explore:
But, first, let us introduce a real data set.
It will be your time to get your hands dirty!
Verneaux (1973) proposed to use fish species to characterize ecological zones along European rivers and streams.
He collected data at 30 localities along the Doubs river, which runs near the France-Switzerland border, in the Jura Mountains.
He showed that fish communities were biological indicators of these water bodies.
Verneaux (1973) proposed to use fish species to characterize ecological zones along European rivers and streams.
He collected data at 30 localities along the Doubs river, which runs near the France-Switzerland border, in the Jura Mountains.
He showed that fish communities were biological indicators of these water bodies.
Their data is split in three matrices:
Verneaux, J. (1973) Cours d'eau de Franche-Comté (Massif du Jura). Recherches écologiques sur le réseau hydrographique du Doubs. Essai de biotypologie. Thèse d'état, Besançon. 1–257.
You can download these datasets from r.qcbs.ca/workshops/r-workshop-09.
spe <- read.csv("data/doubsspe.csv", row.names = 1) env <- read.csv("data/doubsenv.csv", row.names = 1)
You can download these datasets from r.qcbs.ca/workshops/r-workshop-09.
spe <- read.csv("data/doubsspe.csv", row.names = 1) env <- read.csv("data/doubsenv.csv", row.names = 1)
This data can also be retrieved from the ade4
package:
library(ade4)data(doubs)spe <- doubs$fishenv <- doubs$env
Alternatively, from the codep
package:
library(codep)data(Doubs)spe <- Doubs.fishenv <- Doubs.env
We can then explore the objects containing our newly loaded data.
Let us peek into the env
data:
str(env)# 'data.frame': 30 obs. of 11 variables:# $ das: num 0.3 2.2 10.2 18.5 21.5 32.4 36.8 49.1 70.5 99 ...# $ alt: int 934 932 914 854 849 846 841 792 752 617 ...# $ pen: num 48 3 3.7 3.2 2.3 3.2 6.6 2.5 1.2 9.9 ...# $ deb: num 0.84 1 1.8 2.53 2.64 2.86 4 1.3 4.8 10 ...# $ pH : num 7.9 8 8.3 8 8.1 7.9 8.1 8.1 8 7.7 ...# $ dur: int 45 40 52 72 84 60 88 94 90 82 ...# $ pho: num 0.01 0.02 0.05 0.1 0.38 0.2 0.07 0.2 0.3 0.06 ...# $ nit: num 0.2 0.2 0.22 0.21 0.52 0.15 0.15 0.41 0.82 0.75 ...# $ amm: num 0 0.1 0.05 0 0.2 0 0 0.12 0.12 0.01 ...# $ oxy: num 12.2 10.3 10.5 11 8 10.2 11.1 7 7.2 10 ...# $ dbo: num 2.7 1.9 3.5 1.3 6.2 5.3 2.2 8.1 5.2 4.3 ...
Variable | Description |
---|---|
das | Distance from the source [km] |
alt | Altitude [m a.s.l.] |
pen | Slope [per thousand] |
deb | Mean min. discharge [m3s-1] |
pH | pH of water |
dur | Ca conc. (hardness) [mgL-1] |
pho | K conc. [mgL-1] |
nit | N conc. [mgL-1] |
amn | NH₄⁺ conc. [mgL-1] |
oxy | Diss. oxygen [mgL-1] |
dbo | Biol. oxygen demand [mgL-1] |
summary(env) # summary statistics
Let us peek into the spe
data:
head(spe)[, 1:8]# CHA TRU VAI LOC OMB BLA HOT TOX# 1 0 3 0 0 0 0 0 0# 2 0 5 4 3 0 0 0 0# 3 0 5 5 5 0 0 0 0# 4 0 4 5 5 0 0 0 0# 5 0 2 3 2 0 0 0 0# 6 0 3 4 5 0 0 0 0
str(spe)...# 'data.frame': 30 obs. of 27 variables:# $ CHA: int 0 0 0 0 0 0 0 0 0 0 ...# $ TRU: int 3 5 5 4 2 3 5 0 0 1 ...# $ VAI: int 0 4 5 5 3 4 4 0 1 4 ...# $ LOC: int 0 3 5 5 2 5 5 0 3 4 ...# $ OMB: int 0 0 0 0 0 0 0 0 0 0 ...# $ BLA: int 0 0 0 0 0 0 0 0 0 0 ...# $ HOT: int 0 0 0 0 0 0 0 0 0 0 ......
# Try some of these!names(spe) # names of objectsdim(spe) # dimensionsstr(spe) # structure of objectssummary(spe) # summary statisticshead(spe) # first 6 rows
Community resemblance is almost always assessed on the basis of species composition data in the form of a site-by-species data table Ym,n.
We can obtain an association matrix Am,m in the form of pairwise distances or dissimilarities Dm,m (or similarities Sm,m) and then analyse those distances.
Community resemblance is almost always assessed on the basis of species composition data in the form of a site-by-species data table Ym,n.
We can obtain an association matrix Am,m in the form of pairwise distances or dissimilarities Dm,m (or similarities Sm,m) and then analyse those distances.
In R
, we can compute distance or dissimilarity matrices using stats::dist()
:
dist(spe)# 1 2 3 4 5 6 7# 2 5.385165 # 3 7.416198 2.449490 # 4 7.874008 4.123106 3.000000 # 5 10.816654 10.677078 10.862780 9.219544 # 6 7.348469 4.582576 4.123106 2.828427 8.185353 # 7 6.855655 2.449490 2.000000 3.605551 10.488088 3.605551 # 8 3.000000 7.071068 8.717798 8.774964 10.954451 7.937254 8.246211# 9 7.810250 8.717798 9.380832 8.774964 9.380832 6.708204 8.485281# 10 6.708204 5.099020 5.291503 5.000000 9.273618 3.605551 4.472136# 11 4.472136 3.316625 5.000000 5.477226 10.246951 5.291503 4.795832# 12 6.708204 3.162278 3.464102 4.582576 11.045361 4.795832 3.162278# 13 7.071068 4.358899 5.196152 6.164414 11.532563 6.633250 5.385165# 14 9.110434 6.324555 6.164414 6.557439 11.916375 7.000000 6.324555# 15 9.899495 7.810250 7.681146 7.483315 10.816654 6.633250 6.855655# 16 11.090537 9.899495 9.797959 9.327379 10.198039 8.426150 9.055385# 17 10.630146 9.486833 9.486833 8.774964 9.797959 8.062258 9.055385# 18 9.848858 9.591663 9.899495 8.660254 9.055385 7.810250 9.380832# 19 11.704700 11.135529 11.045361 10.148892 9.380832 8.888194 10.770330# 20 13.453624 14.212670 14.628739 13.453624 10.770330 12.206556 14.212670# 21 14.456832 15.362291 15.811388 14.525839 11.489125 13.601471 15.556349# 22 16.643317 17.663522 18.110770 16.763055 13.416408 15.779734 17.663522# 23 3.872983 7.483315 9.055385 9.000000 10.583005 7.937254 8.485281# 24 7.071068 9.539392 10.816654 10.583005 11.357817 9.486833 10.246951# 25 5.291503 8.306624 9.643651 9.273618 9.746794 8.246211 9.110434# 26 11.135529 12.609520 13.304135 12.409674 10.723805 11.224972 12.922848# 27 15.362291 16.462078 16.881943 15.811388 13.076697 14.696938 16.583124# 28 16.522712 17.549929 18.000000 16.941074 14.352700 15.905974 17.606817# 29 18.761663 19.364917 19.621417 18.384776 15.524175 17.776389 19.364917# 30 20.396078 21.377558 21.794495 20.542639 17.117243 19.849433 21.517435# 8 9 10 11 12 13 14# 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 7.211103 # 10 6.480741 6.480741 # 11 5.385165 7.549834 4.582576 # 12 8.124038 8.717798 5.477226 3.872983 # 13 8.426150 10.049876 6.855655 4.000000 3.316625 # 14 10.198039 10.583005 7.615773 6.244998 4.242641 3.316625 # 15 10.630146 9.746794 6.855655 7.745967 6.403124 6.633250 5.000000# 16 11.489125 10.862780 8.485281 10.049876 9.591663 9.746794 8.944272# 17 10.770330 9.899495 8.246211 9.219544 9.273618 9.643651 9.380832# 18 9.695360 8.602325 7.874008 8.774964 9.380832 9.949874 9.797959# 19 11.313708 9.486833 9.273618 11.000000 11.313708 12.041595 11.832160# 20 13.114877 11.489125 12.727922 13.527749 14.422205 15.000000 14.966630# 21 14.142136 13.266499 14.142136 14.662878 15.684387 16.031220 16.062378# 22 16.370706 15.033296 16.248077 16.941074 17.832555 18.303005 18.165902# 23 2.449490 6.324555 6.633250 5.744563 8.366600 8.774964 10.392305# 24 6.403124 7.549834 8.544004 8.124038 10.148892 10.583005 11.789826# 25 4.358899 7.280110 7.000000 6.782330 9.110434 9.486833 10.723805# 26 10.723805 10.148892 11.445523 11.747340 13.000000 13.490738 13.892444# 27 15.066519 13.892444 15.264338 15.748016 16.703293 17.146428 17.117243# 28 16.248077 14.899664 16.309506 16.822604 17.720045 18.193405 18.220867# 29 18.681542 17.521415 18.303005 18.761663 19.416488 19.646883 19.313208# 30 20.174241 19.467922 20.371549 20.736441 21.610183 21.863211 21.931712# 15 16 17 18 19 20 21# 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16 5.916080 # 17 8.185353 6.164414 # 18 8.660254 7.615773 3.464102 # 19 10.630146 9.165151 6.633250 6.000000 # 20 13.674794 12.649111 9.380832 7.615773 6.324555 # 21 15.066519 13.928388 11.224972 9.380832 7.874008 3.741657 # 22 16.763055 15.556349 13.416408 11.489125 10.583005 6.480741 4.000000# 23 10.630146 11.489125 10.295630 9.055385 10.488088 11.916375 13.114877# 24 11.747340 12.449900 10.723805 9.110434 9.746794 10.148892 11.269428# 25 10.583005 11.180340 10.148892 8.660254 9.539392 10.630146 11.618950# 26 13.190906 13.076697 10.816654 8.774964 7.810250 5.385165 5.385165# 27 16.062378 15.329710 13.304135 11.532563 10.246951 6.244998 4.582576# 28 17.175564 16.370706 14.352700 12.489996 11.489125 7.745967 6.000000# 29 18.330303 17.058722 14.662878 13.076697 12.767145 9.000000 7.000000# 30 21.023796 19.621417 17.117243 15.652476 15.000000 11.180340 9.000000# 22 23 24 25 26 27 28# 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16 # 17 # 18 # 19 # 20 # 21 # 22 # 23 15.362291 # 24 13.304135 4.358899 # 25 13.964240 3.000000 3.741657 # 26 7.549834 9.433981 7.071068 8.000000 # 27 5.000000 14.035669 11.916375 12.649111 5.291503 # 28 5.099020 15.231546 13.076697 13.964240 7.000000 3.872983 # 29 5.567764 17.804494 15.874508 16.431677 9.899495 6.633250 5.567764# 30 8.185353 19.416488 17.832555 18.055470 11.916375 9.165151 7.280110# 29# 2 # 3 # 4 # 5 # 6 # 7 # 8 # 9 # 10 # 11 # 12 # 13 # 14 # 15 # 16 # 17 # 18 # 19 # 20 # 21 # 22 # 23 # 24 # 25 # 26 # 27 # 28 # 29 # 30 6.480741
For simplicity, we have done this without specifying arguments.
Run dist(spe)
from your end, note what you observe and see what the commands below show you:
class(dist(spe))
str(dist(spe))
as.matrix(dist(spe))
dim(as.matrix(dist(spe)))
If possible, do this part with the participants.
Show them how a dist class object is structured. Highlight that it is a lower-triangular matrix, where the distances between the same objects are hidden.
There are three groups of distance coefficients: metrics, ... , ... .
The first group consists of metrics, and its coefficients satisfy the following properties:
We can spot all these properties below:
as.matrix(dist(spe))[1:3, 1:3]# 1 2 3# 1 0.000000 5.385165 7.416198# 2 5.385165 0.000000 2.449490# 3 7.416198 2.449490 0.000000
Highlight the triangular nature of the dist
class, that it can be converted to a matrix, and the dimensions (noting that it has the same number of species).
There are three groups of distance coefficients: metrics, semimetrics, ... .
The second group consists of semimetrics, and they violate the triangle inequality property:
Highlight the triangular nature of the dist
class, that it can be converted to a matrix, and the dimensions (noting that it has the same number of species).
There are three groups of distance coefficients: metrics, semimetrics, nonmetrics.
The third group consists of nonmetrics, and they violate the positiveness property of metric distances:
Highlight the triangular nature of the dist
class, that it can be converted to a matrix, and the dimensions (noting that it has the same number of species).
The most common metric distance measure is the Euclidean distance.
It is computed using the Pythagorean formula:
D1(x1,x2)=√p∑j=1(y1j−y2j)2
Using stats::dist()
, we can compute it with:
spe.D.Euclid <- dist(x = spe, method = "euclidean")
And, we can test whether a distance is Euclidean using:
is.euclid(spe.D.Euclid)# [1] TRUE
The figure on the side needs to be finished. It is supposed to represent the Euclidean distance between sites x1 and x2 in the 2D Cartesian space. The axes are the descriptors y1 and y2, respectively. The arrow on the bottom is the distance between the position of y21 and y11.
You can use it to recall the Pythagorean theorem that everyone learns in high school, where the length of the hypotenuse and a and b denote the two lengths of the legs of a right triangle.
method = "euclidean"
is the default parameter
Your turn! Using the dist()
function, compute the Euclidean distance matrix Dhmm for the species abundances by site matrix Yhmm below:
Y.hmm <- data.frame( y1 = c(0, 0, 1), y2 = c(4, 1, 0), y3 = c(8, 1, 0))
Sites | y1 | y2 | y3 |
---|---|---|---|
s1 | 0 | 4 | 8 |
s2 | 0 | 1 | 1 |
s3 | 1 | 0 | 0 |
After this, look into the numbers, think critically about them, and be ready to chat! (5 to 10 minutes)
Your turn! Using the dist()
function, compute the Euclidean distance matrix Dhmm for the species abundances by site matrix Yhmm below:
Y.hmm <- data.frame( y1 = c(0, 0, 1), y2 = c(4, 1, 0), y3 = c(8, 1, 0))
Sites | y1 | y2 | y3 |
---|---|---|---|
s1 | 0 | 4 | 8 |
s2 | 0 | 1 | 1 |
s3 | 1 | 0 | 0 |
After this, look into the numbers, think critically about them, and be ready to chat! (5 to 10 minutes)
Solution:
Y.hmm.DistEu <- dist(x = Y.hmm, method = "euclidean")as.matrix(Y.hmm.DistEu)
Output:
# 1 2 3# 1 0.000000 7.615773 9.000000# 2 7.615773 0.000000 1.732051# 3 9.000000 1.732051 0.000000
Tip: Look into the composition and the distances between sites s2 and s3 and between s1 and s2.
Solution:
Y.hmm# y1 y2 y3# 1 0 4 8# 2 0 1 1# 3 1 0 0
Output:
as.matrix(Y.hmm.DistEu)# 1 2 3# 1 0.000000 7.615773 9.000000# 2 7.615773 0.000000 1.732051# 3 9.000000 1.732051 0.000000
The Euclidean distance between sites s2 and s3, which have no species in common, is smaller than the distance between s1 and s2, which share species y2 and y3 (!).
From an ecological perspective, this is a problematic assessment of the relationship among sites.
Solution:
Y.hmm# y1 y2 y3# 1 0 4 8# 2 0 1 1# 3 1 0 0
Output:
as.matrix(Y.hmm.DistEu)# 1 2 3# 1 0.000000 7.615773 9.000000# 2 7.615773 0.000000 1.732051# 3 9.000000 1.732051 0.000000
The Euclidean distance between sites s2 and s3, which have no species in common, is smaller than the distance between s1 and s2, which share species y2 and y3 (!).
From an ecological perspective, this is a problematic assessment of the relationship among sites.
This issue is known as the double-zero problem, i.e. double zeroes are treated in the same way as double presences, so that the double zeros shrink the distance between two sites.
Euclidean distances ( D1 ) should thus not be used to compare sites based on species abundances.
Orlóci (1967) proposed the Chord distance to analyse community composition.
It consists of:
1. Normalizing the data, i.e. scaling site vectors to length 1 by dividing species abundances in a given sample by the square-rooted sum of square abundances in all samples as
y′Uj=yUj/s∑j=1y2Uj
2. Calculating the Euclidean distances on this normalized data:
D3(x1,x2)=√p∑j=1(y′1j−y′2j)2
We can use vegan::vegdist()
for this one:
spe.D.Ch <- vegdist(spe, method = "chord")as.matrix(spe.D.Ch)[1:3, 1:3]# 1 2 3# 1 0.0000000 0.7653669 0.9235374# 2 0.7653669 0.0000000 0.2309609# 3 0.9235374 0.2309609 0.0000000
When two sites share the same species in the same proportions of the number of individuals the value of D3 is 0, and when no species are shared, its value is √2.
method = "euclidean"
is the default parametre
There are many other metric distances of interest, which we can use:
What happens if we compute Chord distances in the same site-by-species matrix Yhmm?
Y.hmm# y1 y2 y3# 1 0 4 8# 2 0 1 1# 3 1 0 0
as.matrix(Y.hmm.DistEu)# 1 2 3# 1 0.000000 7.615773 9.000000# 2 7.615773 0.000000 1.732051# 3 9.000000 1.732051 0.000000
Y.hmm.DistCh <- vegdist(Y.hmm, method = "chord")
as.matrix(Y.hmm.DistCh)# 1 2 3# 1 0.0000000 0.3203645 1.414214# 2 0.3203645 0.0000000 1.414214# 3 1.4142136 1.4142136 0.000000
There are many other metric distances of interest, which we can use:
What happens if we compute Chord distances in the same site-by-species matrix Yhmm?
Y.hmm# y1 y2 y3# 1 0 4 8# 2 0 1 1# 3 1 0 0
as.matrix(Y.hmm.DistEu)# 1 2 3# 1 0.000000 7.615773 9.000000# 2 7.615773 0.000000 1.732051# 3 9.000000 1.732051 0.000000
Y.hmm.DistCh <- vegdist(Y.hmm, method = "chord")
as.matrix(Y.hmm.DistCh)# 1 2 3# 1 0.0000000 0.3203645 1.414214# 2 0.3203645 0.0000000 1.414214# 3 1.4142136 1.4142136 0.000000
Adding any number of double zeroes to a pair of sites does not change the value of D3.
Hence, Chord distances can be used to compare sites described by species abundances!
method = "euclidean"
is the default parameter
Another popular association coefficient is the Jaccard similarity coefficient (1900).
It is only appropriate for binary data, and its distance coefficient is defined with the size of the intersection divided by the size of the union of the sample sets.
D7(x1,x2)=1−|x1∩x2||x1∪x2|=1−|x1∩x2||x1|+|x2|−|x1∩x2|=1−aa+b+c
where,
Another popular association coefficient is the Jaccard similarity coefficient (1900).
It is only appropriate for binary data, and its distance coefficient is defined with the size of the intersection divided by the size of the union of the sample sets.
D7(x1,x2)=1−|x1∩x2||x1∪x2|=1−|x1∩x2||x1|+|x2|−|x1∩x2|=1−aa+b+c
where,
For example, for sites x1 and x2:
x1,x2 | y1 | y2 | y3 | y4 | y5 |
---|---|---|---|---|---|
x1 | 0 | 1 | 0 | 1 | 0 |
x2 | 0 | 1 | 1 | 1 | 1 |
So:
D7(x1,x2)=
1−22+2+1=
0.6
All parameters in Jaccard's similarity coefficient have equal weights.
D7(x1,x2)=1−aa+b+c
However, you may want to consider that a presence of a species is more informative than its absence.
The distance corresponding to Sørensen's similarity coefficient (1948) gives weight to double presences:
D13(x1,x2)=1−2a2a+b+c=b+c2a+b+c
where,
The Bray-Curtis dissimilarity coefficient is a modified version of the Sørensen's index and allows for species abundances:
D14(x1,x2)=∑|y1j−y2j|∑(y1j+y2j)=
D14(x1,x2)=1−2WA+B where,
The Bray-Curtis dissimilarity coefficient is a modified version of the Sørensen's index and allows for species abundances:
D14(x1,x2)=∑|y1j−y2j|∑(y1j+y2j)=
D14(x1,x2)=1−2WA+B where,
For example, for sites x1 and x2:
x1,x2 | y1 | y2 | y3 | y4 | y5 |
---|---|---|---|---|---|
x1 | 2 | 1 | 0 | 5 | 2 |
x2 | 5 | 1 | 3 | 1 | 1 |
So:
D14(x1,x2)=1−2×58+12= D14(x1,x2)=0.5
In R
, you can use use the vegan::vegdist()
function to calculate the Jaccard's and Sørensen's indices:
spe.D.Jac <- vegdist(spe, method = "jaccard", binary = TRUE)
spe.D.Sor <- vegdist(spe, method = "bray", binary = TRUE)
Because both Jaccard's and Sørensen's are only appropriate for presence-absence data, you must binary-transform abundance data using
binary = TRUE
invegdist()
.
To calculate the Bray-Curtis dissimilarity coefficient, which can account for abundances, you need to set binary = FALSE
.
spe.db.pa <- vegdist(spe, method = "bray", binary = FALSE)spe.db <- as.matrix(spe.db.pa)
In many datasets, variables are correlated, and using a distance metric that ignores this can lead to misleading results.
The Mahalanobis distance takes into account the correlation between variables in the data.
D2Mahal=(x−μ)TΣ−1(x−μ) D represents the Mahalanobis distance, x is the point you're measuring the distance from, μ is the mean vector of the distribution you're measuring against, and Σ is the covariance matrix of the distribution.
It is also robust to changes in the distribution of the data (it does not assume normality).
In many datasets, variables are correlated, and using a distance metric that ignores this can lead to misleading results.
The Mahalanobis distance takes into account the correlation between variables in the data.
D2Mahal=(x−μ)TΣ−1(x−μ) D represents the Mahalanobis distance, x is the point you're measuring the distance from, μ is the mean vector of the distribution you're measuring against, and Σ is the covariance matrix of the distribution.
It is also robust to changes in the distribution of the data (it does not assume normality).
In R
, we can compute it as follows:
# Set the seed for reproducibilityset.seed(123)# Generate random data from a multivariate normal distributiondata <- data.frame(x = rnorm(100, mean = 10, sd = 2), y = rnorm(100, mean = 5, sd = 1), z = rnorm(100, mean = 20, sd = 4))# Calculate the Mahalanobis distancemahalanobis_dist <- mahalanobis(data, center = colMeans(data), cov = cov(data))
We can create graphical depictions of association matrices using the coldiss()
function:
coldiss(spe.D.Jac)
Communities sampled over homogeneous or short environmental conditions can have species compositions with few zeroes, so that Euclidean distances could be enough to characterize them.
Nevertheless, this is rarely the reality.
Species may be highly frequent when conditions are favourable, or may be absent from many sites. Sometimes, this skewness may introduce spurious problems to our analyses.
We may then have to transform our composition data to appropriately analyze it.
Communities sampled over homogeneous or short environmental conditions can have species compositions with few zeroes, so that Euclidean distances could be enough to characterize them.
Nevertheless, this is rarely the reality.
Species may be highly frequent when conditions are favourable, or may be absent from many sites. Sometimes, this skewness may introduce spurious problems to our analyses.
We may then have to transform our composition data to appropriately analyze it.
In R
, we can rely on vegan::decostand()
for many types of transformations.
Take a look into the help of this function to see the available options:
?decostand()
Let us see some of them.
We can change the argument method
to "pa"
to transform our abundance data into presence-absence data:
If yij≥1, then, y′ij=1.
Let us recall our spe
data set:
spe[1:6, 1:6]# CHA TRU VAI LOC OMB BLA# 1 0 3 0 0 0 0# 2 0 5 4 3 0 0# 3 0 5 5 5 0 0# 4 0 4 5 5 0 0# 5 0 2 3 2 0 0# 6 0 3 4 5 0 0
Let us transform spe
abundances to presence-absences:
spe.pa <- decostand(spe, method = "pa")spe.pa[1:6, 1:6]# CHA TRU VAI LOC OMB BLA# 1 0 1 0 0 0 0# 2 0 1 1 1 0 0# 3 0 1 1 1 0 0# 4 0 1 1 1 0 0# 5 0 1 1 1 0 0# 6 0 1 1 1 0 0
Sometimes, one wants to remove the effects of highly abundant units. We can transform the data into profiles of relative species abundances through the following equation:
y′ij=yijyi+
Let us recall our spe
dataset:
spe[1:5, 1:6]# CHA TRU VAI LOC OMB BLA# 1 0 3 0 0 0 0# 2 0 5 4 3 0 0# 3 0 5 5 5 0 0# 4 0 4 5 5 0 0# 5 0 2 3 2 0 0
In decostand()
:
spe.total <- decostand(spe, method = "total")spe.total[1:5, 1:6]# CHA TRU VAI LOC OMB BLA# 1 0 1.00000000 0.00000000 0.00000000 0 0# 2 0 0.41666667 0.33333333 0.25000000 0 0# 3 0 0.31250000 0.31250000 0.31250000 0 0# 4 0 0.19047619 0.23809524 0.23809524 0 0# 5 0 0.05882353 0.08823529 0.05882353 0 0
yi+ is used to indicate the sample total count over all j=1,…,m species, for the ith sample.
We can take the square-root of the species profile transformation and obtain the Hellinger transformation, which has very good mathematical properties and allows us to reduce the effects of yij values that are extremely large.
y′ij=√yijyi+
Let us recall our spe
dataset:
spe[1:5, 1:6]# CHA TRU VAI LOC OMB BLA# 1 0 3 0 0 0 0# 2 0 5 4 3 0 0# 3 0 5 5 5 0 0# 4 0 4 5 5 0 0# 5 0 2 3 2 0 0
In decostand()
:
spe.total <- decostand(spe, method = "hellinger")spe.total[1:5, 1:6]# CHA TRU VAI LOC OMB BLA# 1 0 1.0000000 0.0000000 0.0000000 0 0# 2 0 0.6454972 0.5773503 0.5000000 0 0# 3 0 0.5590170 0.5590170 0.5590170 0 0# 4 0 0.4364358 0.4879500 0.4879500 0 0# 5 0 0.2425356 0.2970443 0.2425356 0 0
Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units:
## ?decostandenv.z <- decostand(env, method = "standardize")
This centres and scales the variables to make your downstream analysis more appropriate:
apply(env.z, 2, mean)# das alt pen deb pH # 1.000429e-16 1.814232e-18 -1.659010e-17 1.233099e-17 -4.096709e-15 # dur pho nit amm oxy # 3.348595e-16 1.327063e-17 -8.925898e-17 -4.289646e-17 -2.886092e-16 # dbo # 7.656545e-17apply(env.z, 2, sd)# das alt pen deb pH dur pho nit amm oxy dbo # 1 1 1 1 1 1 1 1 1 1 1
One application of association matrices is clustering.
Clustering highlights structures in the data by partitioning either the objects or the descriptors.
One goal of ecologists could be to divide a set of sites into groups with respect to their environmental conditions or their community composition.
Its results can be represented as dendrograms (tree-like diagrams), which describe how closely observations are.
Clustering algorithms generally build on this series of steps:
Clustering algorithms generally build on this series of steps:
From the many existing hierarchical clustering algorithms, we will explore:
Clustering algorithms generally build on this series of steps:
From the many existing hierarchical clustering algorithms, we will explore:
In R
, we will use the functions hclust()
and agnes()
to build our dendrograms.
In single linkage agglomerative clustering (also called nearest neighbour sorting), the objects at the closest distances agglomerate. which often generates long thin clusters or chains of objects. Conversely, in complete linkage agglomerative clustering, an object agglomerates to a group only when linked to the furthest element of the group, which in turn links it to all members of that group. It will form many small separate groups, and is more appropriate to look for contrasts, discontinuities in the data. Ward's minimum variance clustering differ from these two methods in that it clusters objects into groups using the criterion of least squares (similar to linear models). Its dendogram shows squared distances by default. To compare with other methods, calculate the sqaure root of the distances first.
The objects divided into small groups (1-2, 3-4, 5)
Connect small groups using the largest distance between their elements
Create a distance matrix from Hellinger transformed Doubs river data and compute the single linkage clustering:
spe.dhe1 <- vegdist(spe.hel, method = "euclidean")spe.dhe1.single <- hclust(spe.dhe1, method = "single")plot(spe.dhe1.single)
spe.dhe1 <- vegdist(spe.hel, method = "euclidean")spe.dhe1.complete <- hclust(spe.dhe1, method = "complete")plot(spe.dhe1.single, main="Single linkage clustering", hang =-1)plot(spe.dhe1.complete, main="Complete linkage clustering", hang=-1)
Single linkage:
Chains of objects occur (e.g. 19,29,30,26)
Complete linkage: Contrasted groups are formed of objects occur
Uses the criterion of least squares to cluster objects into groups
At each step, the pair of clusters merging is the one leading to the minimum increase in total within-group sum of squares
Clusters generated using this method tend to be more spherical and to contain similar number of objects
Compute the Ward's minimum variance clustering and plot the dendrogram by using the square root of the distances:
spe.dhel.ward <- hclust(spe.dhe1, method = "ward.D2")spe.dhel.ward$height <- sqrt(spe.dhel.ward$height)plot(spe.dhel.ward, hang = -1) # hang = -1 aligns objects at 0
Ward's clustering algorithm can only be applied with Euclidean distances. Bray-Curtis distances are not Euclidean, so they must be square-rooted first before used with Ward's algorithm.
However, clustering allows us to:
While cluster analysis looks for discontinuities in a dataset, ordination extracts the main trends in the form of continuous axes.
From now, we will look into four types of unconstrained ordination methods...
While cluster analysis looks for discontinuities in a dataset, ordination extracts the main trends in the form of continuous axes.
From now, we will look into four types of unconstrained ordination methods...
...wait, what do we mean about unconstrained ordinations? Anyone?
While cluster analysis looks for discontinuities in a dataset, ordination extracts the main trends in the form of continuous axes.
From now, we will look into four types of unconstrained ordination methods...
...wait, what do we mean about unconstrained ordinations? Anyone?
If no one speaks out, choose a "volunteer", presenter!
While cluster analysis looks for discontinuities in a dataset, ordination extracts the main trends in the form of continuous axes.
From now, we will look into four types of unconstrained ordination methods...
...wait, what do we mean about unconstrained ordinations? Anyone?
If no one speaks out, choose a "volunteer", presenter!
Unconstrained ordinations assess relationships within a single set of variables. No attempt is made to define the relationship between a set of independent variables and one or more dependent variables.
While cluster analysis looks for discontinuities in a dataset, ordination extracts the main trends in the form of continuous axes.
From now, we will look into four types of unconstrained ordination methods...
...wait, what do we mean about unconstrained ordinations? Anyone?
If no one speaks out, choose a "volunteer", presenter!
Unconstrained ordinations assess relationships within a single set of variables. No attempt is made to define the relationship between a set of independent variables and one or more dependent variables.
In other words, the interpretation of potential effects of other factors that generated observed patterns can only be made indirectly, because those factors are not explicitly included in the analyses.
While cluster analysis looks for discontinuities in a dataset, ordination extracts the main trends in the form of continuous axes.
From now, we will look into four types of unconstrained ordination methods...
...wait, what do we mean about unconstrained ordinations? Anyone?
If no one speaks out, choose a "volunteer", presenter!
Unconstrained ordinations assess relationships within a single set of variables. No attempt is made to define the relationship between a set of independent variables and one or more dependent variables.
In other words, the interpretation of potential effects of other factors that generated observed patterns can only be made indirectly, because those factors are not explicitly included in the analyses.
Here, we will explore:
Principal Component Analysis
Principal Coordinate Analysis
Correspondence Analysis
Non-Metric MultiDimensional Scaling
We already understand the meaning of variance and mean, and how to calculate them:
σ2x=∑ni=1(xi−μ)2n
μx=1nn∑i=ixi
These are very useful to understand the centre and the dispersion of a given variable or dimension.
Nevertheless, we are often interested in more than one dimension, and want to measure how much of each dimension vary from the mean with respect to each other.
We already understand the meaning of variance and mean, and how to calculate them:
σ2x=∑ni=1(xi−μ)2n
μx=1nn∑i=ixi
These are very useful to understand the centre and the dispersion of a given variable or dimension.
Nevertheless, we are often interested in more than one dimension, and want to measure how much of each dimension vary from the mean with respect to each other.
Covariance is such a measure, which can depict how two dimensions co-vary:
varx=σ2x=∑ni=1(xi−μ)2n
covx,y=∑Ni=1(xi−ˉx)(yi−ˉy)N−1
Intuitively, we can measure the covariance between more than two variables. Let us say, between the variables x, y, and z:
covx,y=∑Ni=1(xi−ˉx)(yi−ˉy)N−1
covx,z=∑Ni=1(xi−ˉx)(zi−ˉz)N−1
covz,y=∑Ni=1(zi−ˉz)(yi−ˉy)N−1 We can represent these calculations in a covariance matrix:
C(x,y,z)=[covx,xcovy,xcovz,xcovx,ycovy,ycovz,ycovx,zcovy,zcovz,z]
Intuitively, we can measure the covariance between more than two variables. Let us say, between the variables x, y, and z:
covx,y=∑Ni=1(xi−ˉx)(yi−ˉy)N−1
covx,z=∑Ni=1(xi−ˉx)(zi−ˉz)N−1
covz,y=∑Ni=1(zi−ˉz)(yi−ˉy)N−1 We can represent these calculations in a covariance matrix:
C(x,y,z)=[covx,xcovy,xcovz,xcovx,ycovy,ycovz,ycovx,zcovy,zcovz,z]
QUIZ TIME
What are the diagonals? And, what happens if variables are independent?
What are the diagonals?
If, covx,y=∑Ni=1(xi−ˉx)(yi−ˉy)N−1, then:
covx,x=∑Ni=1(xi−ˉx)(xi−ˉx)N=∑ni=1(xi−ˉx)2N=varx
What are the diagonals?
If, covx,y=∑Ni=1(xi−ˉx)(yi−ˉy)N−1, then:
covx,x=∑Ni=1(xi−ˉx)(xi−ˉx)N=∑ni=1(xi−ˉx)2N=varx So, that:
C(x,y,z)=[covx,xcovy,xcovz,xcovx,ycovy,ycovz,ycovx,zcovy,zcovz,z]=[varxcovy,xcovz,xcovx,yvarycovz,ycovx,zcovy,zvarz]
The covariance of a variable with itself is its variance!
What happens if the variables are independent?
x <- rnorm(5000, mean = 0, sd = 1)y <- rnorm(5000, mean = 0, sd = 1)z <- rnorm(5000, mean = 0, sd = 1)xyz <- data.frame(x, y, z)GGally::ggpairs(xyz)
What happens if the variables are independent?
x <- rnorm(5000, mean = 0, sd = 1)y <- rnorm(5000, mean = 0, sd = 1)z <- rnorm(5000, mean = 0, sd = 1)xyz <- data.frame(x, y, z)GGally::ggpairs(xyz)
cov(xyz)
# x y z# x 0.98469 -0.00787 0.00880# y -0.00787 1.00997 0.00286# z 0.00880 0.00286 1.01408
If variables are perfectly independent (or uncorrelated), the covariance matrix C(x,y,z) is:
C(x,y,z)=[varx000vary000varz]
i.e. a covariance closer to 1 means that variables are colinear.
And, here, varx=vary=varz=1.
We are often interested in observing variables in different ways.
In this process, we create a new variable, let us say xnew, by multiplying and/or adding the values of the original variable x by constants. For instance:
We are often interested in observing variables in different ways.
In this process, we create a new variable, let us say xnew, by multiplying and/or adding the values of the original variable x by constants. For instance:
We can transform a variable of distances measured in kilometres dkm into miles, as: dmi=0.62×dkm
We can also transform Fahrenheit degrees to Celsius degrees as:
TC=0.5556×TFahr−17.778
These are examples of linear transformations: the transformed variables are linearly related to the original variables and the shapes of the distribution are not changed.
We are often interested in observing variables in different ways.
In this process, we create a new variable, let us say xnew, by multiplying and/or adding the values of the original variable x by constants. For instance:
We can transform a variable of distances measured in kilometres dkm into miles, as: dmi=0.62×dkm
We can also transform Fahrenheit degrees to Celsius degrees as:
TC=0.5556×TFahr−17.778
These are examples of linear transformations: the transformed variables are linearly related to the original variables and the shapes of the distribution are not changed.
Two types of transformations are very important for us:
Centering, which subtracts the values of a predictor by the mean:
x′=xi−ˉx
Scaling, which divides the predictor variables by their standard deviation:
x″
???
Centering is basically a technique where the mean of independent variables is subtracted from all the values. It means all independent variables have zero mean. Scaling is similar to centering. Predictor variables are divided by their standard deviation.
The presenter should mention here that centering brings the mean to zero and scaling brings the standard deviation to one-unit. They also should mention that variables become comparable when scaling, as their unit is lost.
Square matrices, such as the covariance matrix, can be decomposed into Eigenvalues and Eigenvectors.
For a square matrix, A_{n \times n}, a vector v is an Eigenvector of A, if there is a scalar, \lambda, for which:
A_{n \times n} v_{n \times 1} = \lambda v_{n \times 1}, or \left(\begin{matrix}a_{11}&\cdots&a_{1n}\\\vdots&\ddots&\vdots\\a_{1n}&\cdots&a_{nn}\\\end{matrix}\right)\left(\begin{matrix}v_1\\\vdots\\v_n\\\end{matrix}\right)=\lambda\left(\begin{matrix}v_1\\\vdots\\v_n\\\end{matrix}\right)
with the value of \lambda being the corresponding Eigenvalue.
Square matrices, such as the covariance matrix, can be decomposed into Eigenvalues and Eigenvectors.
For a square matrix, A_{n \times n}, a vector v is an Eigenvector of A, if there is a scalar, \lambda, for which:
A_{n \times n} v_{n \times 1} = \lambda v_{n \times 1}, or \left(\begin{matrix}a_{11}&\cdots&a_{1n}\\\vdots&\ddots&\vdots\\a_{1n}&\cdots&a_{nn}\\\end{matrix}\right)\left(\begin{matrix}v_1\\\vdots\\v_n\\\end{matrix}\right)=\lambda\left(\begin{matrix}v_1\\\vdots\\v_n\\\end{matrix}\right)
with the value of \lambda being the corresponding Eigenvalue.
In other words, the matrix A effectively stretches the Eigenvector v by the amount specified by the Eigenvalue (scalar) \lambda.
An Eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it.
Wait! What do we mean by unchanged direction?
Wait! What do we mean by unchanged direction?
Let us represent this with this simple example.
We can transform a square into a parallelogram using a single-axis shear transformation.
Wait! What do we mean by unchanged direction?
Let us represent this with this simple example.
We can transform a square into a parallelogram using a single-axis shear transformation.
Let S be the square with vertices (0,0),\,(1,0),\,(1,1),\,(1,0) that will be shear-transformed to the P parallelogram with vertices (0,0),\,(1,0),\,(1,1.57),\,(1,0.57).
We can see that after the linear transformation, the purple arrow has not changed direction, i.e. it is an Eigenvector of S.
On the other hand, the red arrow changed direction, and thus is not an Eigenvector of S.
Keep up with the algebra torture!
A fabulous and simple property of symmetric matrices that we can explain here!
Let us assume that x is an eigenvector of A corresponding to the eigenvalue λ_1 and y an eigenvector of A corresponding to the eigenvalue λ_2, with λ_1≠λ_2.
Ax=\lambda_1x \\ Ay=\lambda_2y
Let us multiply each one by the other transposed Eigenvector. y^{\intercal}Ax=\lambda_1y^{\intercal}x \\ x^{\intercal}A^{\intercal}y=\lambda_2x^{\intercal}y
Keep up with the algebra torture!
A fabulous and simple property of symmetric matrices that we can explain here!
Let us assume that x is an eigenvector of A corresponding to the eigenvalue λ_1 and y an eigenvector of A corresponding to the eigenvalue λ_2, with λ_1≠λ_2.
Ax=\lambda_1x \\ Ay=\lambda_2y
Let us multiply each one by the other transposed Eigenvector. y^{\intercal}Ax=\lambda_1y^{\intercal}x \\ x^{\intercal}A^{\intercal}y=\lambda_2x^{\intercal}y
Now subtract the second equation from the first one and use the commutativity of the scalar product:
y^{\intercal}Ax-x^{\intercal}A^{\intercal}y=\lambda_1y^{\intercal}x - \lambda_2x^{\intercal}y \\ 0 = (\lambda_1 - \lambda_2)y^{\intercal}x
Because we know that \lambda_1-\lambda_2\neq 0, then y^{\intercal}x = 0, i.e., \mathbf{x}\perp\mathbf{y}, i.e. are orthogonal!
So, what does the Eigendecomposition of a variance-covariance matrix tell us?
Hi, Presenter. The explanation of this part is very useful and quite simple, so everyone can understand what orthogonality is. It is a matter of simple equation operations and subtractions.
Keep up with the algebra torture!
If v_i' v_i = 1, then Av_i=\lambda_iv_i can be written as: v_i' A v_i = \lambda_i In fact, v' A v is the variance of a linear combination with weights in v, i.e. \text{Var}(v_i'\,A)=v_i'\,\text{Var}(A)\,v_i.
Hence, we can connect the dots!
Keep up with the algebra torture!
If v_i' v_i = 1, then Av_i=\lambda_iv_i can be written as: v_i' A v_i = \lambda_i In fact, v' A v is the variance of a linear combination with weights in v, i.e. \text{Var}(v_i'\,A)=v_i'\,\text{Var}(A)\,v_i.
Hence, we can connect the dots!
Remember that the Eigenvalues in our variance-covariance matrix A are directly related to the variance!
Thus, to find a vector v that maximizes the variance, v' A v, all we must do is to choose the Eigenvector corresponding to the largest Eigenvalue \lambda_i!
So that the maximum variance is \lambda_1!
Keep up with the algebra torture!
If v_i' v_i = 1, then Av_i=\lambda_iv_i can be written as: v_i' A v_i = \lambda_i In fact, v' A v is the variance of a linear combination with weights in v, i.e. \text{Var}(v_i'\,A)=v_i'\,\text{Var}(A)\,v_i.
Hence, we can connect the dots!
Remember that the Eigenvalues in our variance-covariance matrix A are directly related to the variance!
Thus, to find a vector v that maximizes the variance, v' A v, all we must do is to choose the Eigenvector corresponding to the largest Eigenvalue \lambda_i!
So that the maximum variance is \lambda_1!
The explained variance of each Eigenvector obeys the order: \lambda_1 > \lambda_2 > \dots > \lambda_k.
This allows us to condense a larger number of original variables into a smaller set of selected vectors with minimal loss of information (i.e. dimensionallity reduction).
This is a good startpoint to set us in the direction of the unconstrained ordination methods we will study today!
They allow us to:
Assess relationships within a set of variables (species or environmental variables);
Find key components of variation among samples, sites, species;
Reduce the number of dimensions in multivariate data while limiting substantial loss of information;
Create new variables for use in subsequent analyses.
Here, we will learn:
Principal Component Analysis;
Principal Coordinate Analysis;
Non-Metric Multidimensional Scaling;
The Principal Component Analysis (PCA) is a linear dimensionality-reduction technique, i.e. it reduces strongly correlated data.
In a nutshell, the PCA linearly transforms the feature from the original space to a new feature space, containing principal components that explain most of the variance in the dataset, i.e. maximize the separation between the data.
The Principal Component Analysis (PCA) is a linear dimensionality-reduction technique, i.e. it reduces strongly correlated data.
In a nutshell, the PCA linearly transforms the feature from the original space to a new feature space, containing principal components that explain most of the variance in the dataset, i.e. maximize the separation between the data.
The principal component space can be written as:
Z_p = ∑_{j=1}^p ϕ_j * X_j where,
That is: the reconstruction for the data can be given by a simple linear complination of the components.
Future versions of this should include a figure (such as a heatmap) with the original standardized data on a left side, and on the right side, a heatmap of the loadings times a heatmap of the components (which can be thought as equal to the sum of rank-one matrices) being equal to the heatmap of the reconstructed data.
PCA can be computed in at least four different ways.
For the sake of simplicity, we will focus here on how to obtain principal components from a correlation matrix.
We will learn how to do it from "scratch" and then how to use R
packages to compute the principal components.
In R
, from scratch!
data(varechem)str(varechem)# 'data.frame': 24 obs. of 14 variables:# $ N : num 19.8 13.4 20.2 20.6 23.8 22.8 26.6 24.2 29.8 28.1 ...# $ P : num 42.1 39.1 67.7 60.8 54.5 40.9 36.7 31 73.5 40.5 ...# $ K : num 140 167 207 234 181 ...# $ Ca : num 519 357 973 834 777 ...# $ Mg : num 90 70.7 209.1 127.2 125.8 ...# $ S : num 32.3 35.2 58.1 40.7 39.5 40.8 33.8 27.1 42.5 60.2 ...# $ Al : num 39 88.1 138 15.4 24.2 ...# $ Fe : num 40.9 39 35.4 4.4 3 ...# $ Mn : num 58.1 52.4 32.1 132 50.1 ...# $ Zn : num 4.5 5.4 16.8 10.7 6.6 9.1 7.4 5.2 9.3 9.1 ...# $ Mo : num 0.3 0.3 0.8 0.2 0.3 0.4 0.3 0.3 0.3 0.5 ...# $ Baresoil: num 43.9 23.6 21.2 18.7 46 40.5 23 29.8 17.6 29.9 ...# $ Humdepth: num 2.2 2.2 2 2.9 3 3.8 2.8 2 3 2.2 ...# $ pH : num 2.7 2.8 3 2.8 2.7 2.7 2.8 2.8 2.8 2.8 ...
In R
, from scratch!
data(varechem)# Step 1 Y <- varechem[, 1:2]head(Y)# N P# 18 19.8 42.1# 15 13.4 39.1# 24 20.2 67.7# 27 20.6 60.8# 23 23.8 54.5# 19 22.8 40.9
Starting point: a matrix Y of n observations and p normally distributed continuous variables;
Standardizing observations, as in Y_{std} = \frac{y_i - \bar{y}}{\sigma_y}; which is the same as centring, as in y_c = [y_i - \bar{y}], and then scaling, as in y_s = \frac{y_i}{\sigma_y};
In R
, from scratch!
data(varechem)# Step 1 Y <- varechem[, 1:2]# Step 2Y_std <- as.matrix(scale(Y))head(Y_std)# N P# 18 -0.46731082 -0.1993234# 15 -1.62503567 -0.4000407# 24 -0.39495301 1.5134640# 27 -0.32259521 1.0518143# 23 0.25626722 0.6303080# 19 0.07537271 -0.2796103round(apply(Y_std, 2, mean))# N P # 0 0round(apply(Y_std, 2, sd))# N P # 1 1
The presenter may recall here or even ask the participants what the standardization is doing.
Starting point: a matrix Y of n observations and p normally distributed continuous variables;
Standardizing observations, as in Y_{std} = \frac{y_i - \bar{y}}{\sigma_y}; which is the same as centring, as in y_c = [y_i - \bar{y}], and then scaling, as in y_s = \frac{y_i}{\sigma_y};
Compute the variance-covariance matrix R = cov(Y_{std});
In R
, from scratch!
data(varechem)# Step 1 Y <- varechem[, 1:2]# Step 2Y_std <- as.matrix(scale(Y))# Step 3(Y_R <- cov(Y_std))# N P# N 1.0000000 -0.2511603# P -0.2511603 1.0000000
Starting point: a matrix Y of n observations and p normally distributed continuous variables;
Standardizing observations, as in Y_{std} = \frac{y_i - \bar{y}}{\sigma_y}; which is the same as centring, as in y_c = [y_i - \bar{y}], and then scaling, as in y_s = \frac{y_i}{\sigma_y};
Compute the variance-covariance matrix R = cov(Y_{std});
Perform the Eigendecomposition of the covariance matrix to obtain the matrix U of Eigenvectors, containing the Principal Components;
In R
, from scratch!
data(varechem)# Step 1 Y <- varechem[, 1:2]# Step 2Y_std <- as.matrix(scale(Y))# Step 3Y_R <- cov(Y_std)# Step 4 (Eigenvalues <- eigen(Y_R)$values)# [1] 1.2511603 0.7488397(Eigenvectors <- eigen(Y_R)$vectors)# [,1] [,2]# [1,] -0.7071068 -0.7071068# [2,] 0.7071068 -0.7071068
The Eigenvectors here are the Principal Components, and as we have seen, each Eigenvector has its corresponding Eigenvalue.
We can represent the distances from the observations to the first Eigenvector (PC1
, in red).
The first principal component is drawn so that the variation of the values along its line is maximal.
The arrows on the principal components are obtained by multiplying their Eigenvalues by the Eigenvectors.
In this first representation, we can observe that a first direction (or the first linear component) is drawn attempting to maximize the variance of the data.
Participants may ask you how different this is from a linear regression. One key difference is on the how the error squares are inimized perpendicularly to the straight line (90 degrees; making it orthogonal), while in the linear regression, the error squares are minimized towards the y-direction.
The Eigenvectors here are the Principal Components, and as we have seen, each Eigenvector has its corresponding Eigenvalue.
We can then represent the distances from the observations to the second Eigenvector (PC2
, in orange).
The second principal component is also drawn maximizing the variance of the data.
Note how the principal components are orthogonal!
The Eigenvectors here are the Principal Components, and as we have seen, each Eigenvector has its corresponding Eigenvalue.
We can then represent the distances from the observations to the second Eigenvector (PC2
, in orange).
The second principal component is also drawn maximizing the variance of the data.
Note how the principal components are orthogonal!
We represented the Eigenvectors, i.e. the principal components!
But, what is the use of the Eigenvalues?
Here, the a second direction (or the second linear component) is drawn such as that the variance of the data is maximized with respect to this second component.
We have seen that the Eigenvalues represent the magnitude (the variance) in the principal components.
In fact, the sum of all Eigenvalues is equal to the sum of variances, which are represented on the diagonal of the variance-covariance matrix.
sum(diag(cov(Y_std)))# [1] 2sum(eigen(cov(Y_std))$values)# [1] 2
We have seen that the Eigenvalues represent the magnitude (the variance) in the principal components.
In fact, the sum of all Eigenvalues is equal to the sum of variances, which are represented on the diagonal of the variance-covariance matrix.
sum(diag(cov(Y_std)))# [1] 2sum(eigen(cov(Y_std))$values)# [1] 2
Intuitively, one can obtain the relative influence of each Eigenvector v_{k} (or \text{PC}_{k}) by dividing their values by the sum of all Eigenvalues.
\text{Explained variance of}~v_{k} = \frac{\lambda_{v_k}}{\sum^p_{i=1}{\lambda_{v}}}
By doing this, we can say that the \text{PC}1 explains 63% of the variance in the data, while \text{PC}2 explains 37% of the variance.
We have seen that the Eigenvalues represent the magnitude (the variance) in the principal components.
In fact, the sum of all Eigenvalues is equal to the sum of variances, which are represented on the diagonal of the variance-covariance matrix.
sum(diag(cov(Y_std)))# [1] 2sum(eigen(cov(Y_std))$values)# [1] 2
Intuitively, one can obtain the relative influence of each Eigenvector v_{k} (or \text{PC}_{k}) by dividing their values by the sum of all Eigenvalues.
\text{Explained variance of}~v_{k} = \frac{\lambda_{v_k}}{\sum^p_{i=1}{\lambda_{v}}}
By doing this, we can say that the \text{PC}1 explains 63% of the variance in the data, while \text{PC}2 explains 37% of the variance.
Finally, we can proceed to the last step of our computation of principal components!
Starting point: a matrix Y of n observations and p normally distributed continuous variables;
Standardizing observations, as in Y_{std} = \frac{y_i - \bar{y}}{\sigma_y}; which is the same as centring, as in y_c = [y_i - \bar{y}], and then scaling, as in y_s = \frac{y_i}{\sigma_y};
Compute the variance-covariance matrix R = cov(Y_{std});
Perform the Eigendecomposition of the covariance matrix to obtain the matrix U of Eigenvectors, containing the Principal Components;
Obtain the feature space by multiplying U with the standardized matrix Y_{std}, i.e. the score matrix F.
In R
, from scratch!
# Step 1 Y <- varechem[, 1:2]# Step 2Y_std <- as.matrix(scale(Y))# Step 3Y_R <- cov(Y_std)# Step 4 Eigenvalues <- eigen(Y_R)$valuesEigenvectors <- eigen(Y_R)$vectors# Step 5F_PrComps <- Y_std %*% Eigenvectorshead(F_PrComps)# [,1] [,2]# 18 0.1894957 0.4713816# 15 0.8662023 1.4319452# 24 1.3494546 -0.7909067# 27 0.9718543 -0.5156358# 23 0.2644868 -0.6269034# 19 -0.2510109 0.1444178
The score matrix, F, (object F_PrComps
) allows one to rotate the new data space, so it is represented in relation to the principal components.
\text{N} ~ \text{P}
\text{PC}1 ~ \text{PC}2
The axis labels are not being printed. I included them at the top of each plot while this is not fixed.
The presenter here should emphasize the rotation, and speak about what the scores are. You can hover the points and show them what was the position of the points in the "de-rotated" plots, and now in the "rotated" ones, highlighting that now, the "new axes" are PC1 and PC2.
This understand will be useful when the participants are going to use the PCA functions that are implemented in R.
PCA can also be computed using the stats::prcomp()
, stats::princomp()
, vegan::rda()
, and ade4::dudi.pca()
functions.
How our PCA from scratch compares
data(varechem)Y <- varechem[, 1:2] Y_std <- as.matrix(scale(Y))Y_R <- cov(Y_std)Eigenvalues <- eigen(Y_R)$valuesEigenvectors <- eigen(Y_R)$vectorsF_PrComps <- Y_std %*% Eigenvectorshead(F_PrComps)# [,1] [,2]# 18 0.1894957 0.4713816# 15 0.8662023 1.4319452# 24 1.3494546 -0.7909067# 27 0.9718543 -0.5156358# 23 0.2644868 -0.6269034# 19 -0.2510109 0.1444178
to stats::prcomp()
?
PCA_prcomp <- prcomp(Y, center = TRUE, scale = TRUE)# or PCA_prcomp <- prcomp(Y_std)head(PCA_prcomp$x)# PC1 PC2# 18 -0.1894957 -0.4713816# 15 -0.8662023 -1.4319452# 24 -1.3494546 0.7909067# 27 -0.9718543 0.5156358# 23 -0.2644868 0.6269034# 19 0.2510109 -0.1444178
PCA can also be computed using the stats::prcomp()
, stats::princomp()
, vegan::rda()
, and ade4::dudi.pca()
functions.
How our PCA from scratch compares
data(varechem)Y <- varechem[, 1:2] Y_std <- as.matrix(scale(Y))Y_R <- cov(Y_std)Eigenvalues <- eigen(Y_R)$valuesEigenvectors <- eigen(Y_R)$vectorsF_PrComps <- Y_std %*% Eigenvectorshead(F_PrComps)# [,1] [,2]# 18 0.1894957 0.4713816# 15 0.8662023 1.4319452# 24 1.3494546 -0.7909067# 27 0.9718543 -0.5156358# 23 0.2644868 -0.6269034# 19 -0.2510109 0.1444178
to stats::princomp()
?
PCA_princomp <- princomp(Y_std)head(PCA_princomp$scores)# Comp.1 Comp.2# 18 -0.1894957 -0.4713816# 15 -0.8662023 -1.4319452# 24 -1.3494546 0.7909067# 27 -0.9718543 0.5156358# 23 -0.2644868 0.6269034# 19 0.2510109 -0.1444178
PCA can also be computed using the stats::prcomp()
, stats::princomp()
, vegan::rda()
, and ade4::dudi.pca()
functions.
How our PCA from scratch compares
data(varechem)Y <- varechem[, 1:2] Y_std <- as.matrix(scale(Y))Y_R <- cov(Y_std)Eigenvalues <- eigen(Y_R)$valuesEigenvectors <- eigen(Y_R)$vectorsF_PrComps <- Y_std %*% Eigenvectorshead(F_PrComps)# [,1] [,2]# 18 0.1894957 0.4713816# 15 0.8662023 1.4319452# 24 1.3494546 -0.7909067# 27 0.9718543 -0.5156358# 23 0.2644868 -0.6269034# 19 -0.2510109 0.1444178
to vegan::rda()
?
PCA_vegan_rda <- rda(Y_std)scores(PCA_vegan_rda, display = "sites", scaling = 1, choices = seq_len(PCA_vegan_rda$CA$rank), const = sqrt(PCA_vegan_rda$tot.chi * (nrow(PCA_vegan_rda$CA$u) - 1)))[1:5, ]# PC1 PC2# 18 -0.1894957 -0.4713816# 15 -0.8662023 -1.4319452# 24 -1.3494546 0.7909067# 27 -0.9718543 0.5156358# 23 -0.2644868 0.6269034
vegan::rda()
is a bit special. It uses alternative scalings. We will not cover them here, but you can study the vignette("decision-vegan")
.
Tell participants that the name rda
refers to a diferent type of constrained ordination technique, but that if we run rda()
with just one variable, it will execute a PCA.
We have implemented PCA on a two-variables dataset, for simplicity.
Let us advance and apply it to our fish species dataset.
For this, we will use the vegan::rda()
function on the Hellinger-transformed fish data and summarise the results:
spe.h.pca <- rda(spe.hel)# summary(spe.h.pca)
The first lines of summary.rda()
tell us about the Total variance and Unconstrained variance in our model.
# [1] "Partitioning of variance:" " Inertia Proportion"# [3] "Total 0.5023 1" "Unconstrained 0.5023 1"
The first lines of summary.rda()
tell us about the Total variance and Unconstrained variance in our model.
# [1] "Partitioning of variance:" " Inertia Proportion"# [3] "Total 0.5023 1" "Unconstrained 0.5023 1"
# [1] "Importance of components:" # [2] " PC1 PC2 PC3 PC4 PC5 PC6 PC7" # [3] "Eigenvalue 0.2491 0.06592 0.04615 0.03723 0.02125 0.01662 0.01477" # [4] "Proportion Explained 0.4959 0.13122 0.09188 0.07412 0.04230 0.03309 0.02940" # [5] "Cumulative Proportion 0.4959 0.62715 0.71903 0.79315 0.83544 0.86853 0.89794" # [6] " PC14 PC15 PC16 PC17 PC18 PC19"# [7] "Eigenvalue 0.002612 0.001505 0.001387 0.001037 0.0007815 0.0004749"# [8] "Proportion Explained 0.005200 0.002996 0.002761 0.002065 0.0015557 0.0009454"# [9] "Cumulative Proportion 0.987229 0.990225 0.992986 0.995051 0.9966069 0.9975523"
This is followed by the Eigenvalues, and their contribution to the variance.
In fact, if we sum all our Eigenvalues, we will obtain the amount of uncostrained variance explained by the analysis!
sum(spe.h.pca$CA$eig)# [1] 0.5023429
Since we have not constrained our ordination, the proportional unconstrained variance is equal to the total variance.
Take a moment to explain the proportion explained, and show that the cummulative proportion will equal to 1 at the 27th PC.
The next information is related to the scaling, to the species scores, and to the site scores.
# [1] "Eigenvalue 0.0004263 0.0002812 0.0002188 0.0001382 0.0000876"# [2] "Proportion Explained 0.0008487 0.0005598 0.0004356 0.0002752 0.0001744"# [3] "Cumulative Proportion 0.9984010 0.9989608 0.9993965 0.9996717 0.9998460"# [4] " PC25 PC26 PC27" # [5] "Proportion Explained 1.062e-04 2.938e-05 1.835e-05" # [6] "Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00" # [7] "Scaling 2 for species and site scores" # [8] "* Species are scaled proportional to eigenvalues" # [9] "* Sites are unscaled: weighted dispersion equal on all dimensions" # [10] "* General scaling constant of scores: 1.953663 " # [11] "" # [12] "" # [13] "Species scores" # [14] "BCO -0.20055 0.08332 -0.074787 -0.0504875 0.073890 0.0249842" # [15] "PCH -0.14626 0.05268 -0.072012 -0.0432572 0.050318 0.0178776" # [16] "GAR -0.35085 -0.09353 0.198664 0.0178669 0.023796 -0.0971362" # [17] "BBO -0.24167 0.03598 -0.079528 -0.0339049 0.096690 0.0620979" # [18] "ABL -0.42269 -0.22879 0.007158 0.1128353 0.006759 0.1248913" # [19] "ANG -0.20521 0.11557 -0.072060 -0.0159902 0.072030 -0.0003801" # [20] "" # [21] "" # [22] "Site scores (weighted sums of species scores)"
]
Species refer to your descriptors (i.e. the columns in your dataset), which here are the fish species.
Scores refer to the position of every species along the principal components.
# [1] "Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00" # [2] "Scaling 2 for species and site scores" # [3] "* Species are scaled proportional to eigenvalues" # [4] "* Sites are unscaled: weighted dispersion equal on all dimensions"# [5] "* General scaling constant of scores: 1.953663 " # [6] "" # [7] "" # [8] "Species scores"
Species refer to your descriptors (i.e. the columns in your dataset), which here are the fish species.
Scores refer to the position of every species along the principal components.
# [1] "Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00" # [2] "Scaling 2 for species and site scores" # [3] "* Species are scaled proportional to eigenvalues" # [4] "* Sites are unscaled: weighted dispersion equal on all dimensions"# [5] "* General scaling constant of scores: 1.953663 " # [6] "" # [7] "" # [8] "Species scores"
# [1] "PCH -0.14626 0.05268 -0.072012 -0.0432572 0.050318 0.0178776"# [2] "GAR -0.35085 -0.09353 0.198664 0.0178669 0.023796 -0.0971362"# [3] "BBO -0.24167 0.03598 -0.079528 -0.0339049 0.096690 0.0620979"# [4] "ABL -0.42269 -0.22879 0.007158 0.1128353 0.006759 0.1248913"# [5] "ANG -0.20521 0.11557 -0.072060 -0.0159902 0.072030 -0.0003801"# [6] "" # [7] "" # [8] "Site scores (weighted sums of species scores)"
Sites represent the rows in your dataset, which here are the different sites along the Doubs river.
Species refer to your descriptors (i.e. the columns in your dataset), which here are the fish species.
Scores refer to the position of every species along the principal components.
# [1] "Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00" # [2] "Scaling 2 for species and site scores" # [3] "* Species are scaled proportional to eigenvalues" # [4] "* Sites are unscaled: weighted dispersion equal on all dimensions"# [5] "* General scaling constant of scores: 1.953663 " # [6] "" # [7] "" # [8] "Species scores"
# [1] "PCH -0.14626 0.05268 -0.072012 -0.0432572 0.050318 0.0178776"# [2] "GAR -0.35085 -0.09353 0.198664 0.0178669 0.023796 -0.0971362"# [3] "BBO -0.24167 0.03598 -0.079528 -0.0339049 0.096690 0.0620979"# [4] "ABL -0.42269 -0.22879 0.007158 0.1128353 0.006759 0.1248913"# [5] "ANG -0.20521 0.11557 -0.072060 -0.0159902 0.072030 -0.0003801"# [6] "" # [7] "" # [8] "Site scores (weighted sums of species scores)"
Sites represent the rows in your dataset, which here are the different sites along the Doubs river.
This information can be obtained with the score()
function that we used before:
scores(spe.h.pca, display = "species" or "sites")
Here, we have 27 principal components. However, we can apply algorithms to select the lowest number of principal components that still account for a large variance in the data.
Here, we have 27 principal components. However, we can apply algorithms to select the lowest number of principal components that still account for a large variance in the data.
We can select the principal components that capture more variance than the average explanation of all principal components. We do this by:
Extracting the Eigenvalues associated to the principal components;
Subsetting the Eigenvalues above the mean Eigenvalue:
ev <- spe.h.pca$CA$eig# ev[ev > mean(ev)]
n <- length(ev)barplot(ev, main = "", col = "grey", las = 2)abline(h = mean(ev), col = "red3", lwd = 2)legend("topright", "Average eigenvalue", lwd = 2, col = "red3" , bty = "n")
Here, we have 27 principal components. However, we can apply algorithms to select the lowest number of principal components that still account for a large variance in the data.
The broken-stick model retains components that explain more variance than would be expected by randomly dividing the variance into p parts.
head(bstick(spe.h.pca))# PC1 PC2 PC3 PC4 PC5 PC6 # 0.07240169 0.05379640 0.04449375 0.03829199 0.03364067 0.02991961
screeplot(spe.h.pca, bstick = TRUE, type = "lines")
All that is left is to discuss scaling and to visualize our results.
Let us practice and compute a PCA on the standardized environmental variables for the same dataset.
env.pca <- rda(env.z)# summary(env.pca, scaling = 2)
All that is left is to discuss scaling and to visualize our results.
Let us practice and compute a PCA on the standardized environmental variables for the same dataset.
env.pca <- rda(env.z)# summary(env.pca, scaling = 2)
Determine our subset of Eigenvalues and their corresponding Eigenvectors:
ev <- env.pca$CA$eig
ev[ev>mean(ev)]# PC1 PC2 PC3 # 5.968749 2.163818 1.065164
All that is left is to discuss scaling and to visualize our results.
Let us practice and compute a PCA on the standardized environmental variables for the same dataset.
env.pca <- rda(env.z)# summary(env.pca, scaling = 2)
Determine our subset of Eigenvalues and their corresponding Eigenvectors:
ev <- env.pca$CA$eig
ev[ev>mean(ev)]# PC1 PC2 PC3 # 5.968749 2.163818 1.065164
plot()
The information computed by the PCA can be represented with biplots.
We can produce a quick and dirty biplot of the PCA using the function plot()
in base R
.
plot(spe.h.pca)
biplot()
biplot()
from base
R
allows for a better interpretation.
biplot(spe.h.pca)
The arrows are plotted to show the directionality and angle of the descriptors in the ordination.
Descriptors at 180 degrees of each other are negatively correlated;
Descriptors at 90 degrees of each other have zero correlation;
Descriptors at 0 degrees of each other are positively correlated.
Type 2 scaling (default
): distances among objects are not approximations of Euclidean distances; angles between descriptor (species) vectors reflect their correlations.
Type 1 scaling: attempts to preserve the Euclidean distance (in multidimensional space) among objects (sites): the angles among descriptor (species) vector are not meaningful.
biplot(spe.h.pca, scaling = 1)
biplot(spe.h.pca, scaling = 2)
2: Best for interpreting relationships among descriptors (species)!
1: Best for interpreting relationships among objects (sites)!
Using everything you have learned, compute a PCA on the mite species abundance data
data(mite)
Be ready to discuss and answer:
Compute PCA on the Hellinger-transformed species data
mite.spe.hel <- decostand(mite, method = "hellinger")mite.spe.h.pca <- rda(mite.spe.hel)
Compute PCA on the Hellinger-transformed species data
mite.spe.hel <- decostand(mite, method = "hellinger")mite.spe.h.pca <- rda(mite.spe.hel)
Apply the Kaiser-Guttman criterion
ev <- mite.spe.h.pca$CA$eigev[ev>mean(ev)]n <- length(ev)barplot(ev, main = "Eigenvalues", col = "grey", las = 2)abline(h = mean(ev), col = "red3", lwd = 2)legend("topright", "Average eigenvalue", lwd = 2, col = "red3", bty = "n")
biplot(mite.spe.h.pca, col = c("red3", "grey15"))
The PCoA is similar in spirit to PCA, but it takes dissimilarities as input data!
It was previously called as Classical Multidimensional Scaling (MDS) and it aims at faithfully representing distances with the lowest possible dimensional space.
It begins with the (i) computation of a distance matrix for the p elements, then (ii) the centering of the matrix by rows and columns, and finally, the (iii) Eigendecomposition of the centered distance matrix.
The PCoA is similar in spirit to PCA, but it takes dissimilarities as input data!
It was previously called as Classical Multidimensional Scaling (MDS) and it aims at faithfully representing distances with the lowest possible dimensional space.
It begins with the (i) computation of a distance matrix for the p elements, then (ii) the centering of the matrix by rows and columns, and finally, the (iii) Eigendecomposition of the centered distance matrix.
To compute a PCoA, we can use the cmdscale()
or the pcoa()
functions from the stats
and ape
packages:
Run a PCoA on the Hellinger distance-transformed fish dataset:
library(ape)spe.h.pcoa <- pcoa(dist(spe.hel))summary(spe.h.pcoa)# Length Class Mode # correction 2 -none- character# note 1 -none- character# values 5 data.frame list # vectors 810 -none- numeric # trace 1 -none- numeric
head(spe.h.pcoa$values)# Eigenvalues Relative_eig Broken_stick Cumul_eig Cumul_br_stick# 1 7.2247720 0.49593627 0.14412803 0.4959363 0.1441280# 2 1.9115366 0.13121526 0.10709099 0.6271515 0.2512190# 3 1.3384761 0.09187817 0.08857247 0.7190297 0.3397915# 4 1.0797364 0.07411728 0.07622679 0.7931470 0.4160183# 5 0.6161946 0.04229798 0.06696753 0.8354450 0.4829858# 6 0.4820459 0.03308949 0.05956013 0.8685345 0.5425459
We can also see the Eigenvectors associated to each Eigenvalue containing the coordinates in the Euclidean space for each site.
head(spe.h.pcoa$vectors)[, 1:5]# Axis.1 Axis.2 Axis.3 Axis.4 Axis.5# 1 -0.510156945 0.34756659 0.60308559 -0.3105433 -0.197553521# 2 -0.697567847 0.04025083 0.10420566 -0.2256121 0.163849922# 3 -0.639279415 -0.02078733 0.03988954 -0.2636284 0.130319412# 4 -0.411968300 -0.13472113 -0.14290276 -0.2900746 0.003953021# 5 0.005052046 -0.09541567 -0.30540211 -0.2852990 -0.319905818# 6 -0.292971968 -0.11423456 -0.31865809 -0.2359718 -0.055670665
biplot.pcoa()
We can display, the distances between sites using the biplot.pcoa()
function, as well as represent the species associated to each site.
biplot.pcoa(spe.h.pcoa, spe.hel)
PCoA can also be used to capture information contained in non-metric distances, such as the popular Bray-Curtis distance. Let us give it a try:
spe.bray.pcoa <- pcoa(spe.db.pa)
spe.bray.pcoa$values$Eigenvalues# [1] 3.695332e+00 1.197352e+00 8.465325e-01 5.253896e-01 4.147227e-01# [6] 2.977969e-01 1.913094e-01 1.554757e-01 1.307617e-01 1.272105e-01# [11] 8.550928e-02 4.357746e-02 3.862321e-02 2.727686e-02 1.303318e-02# [16] 6.704839e-03 3.817779e-03 1.300463e-03 0.000000e+00 -3.597737e-05# [21] -4.167639e-03 -8.970687e-03 -1.473688e-02 -1.601025e-02 -2.165231e-02# [26] -3.127263e-02 -3.443356e-02 -3.763160e-02 -6.064415e-02 -6.882758e-02
Note the negative eigenvalues!
This is because non-metric distances cannot be represented in Euclidean space without corrections (see Legendre & Legendre 2012 for more details on this):
spe.bray.pcoa <- pcoa(spe.db.pa, correction = "cailliez")
The corrected Eigenvalues are now on a new column!
spe.bray.pcoa$values$Corr_eig# [1] 5.208616362 1.735210656 1.249426268 0.813115534 0.690430331 0.513559025# [7] 0.387520962 0.363702867 0.296206500 0.273082611 0.215178142 0.154525233# [13] 0.150803988 0.118203435 0.088726577 0.073233820 0.069460754 0.057344134# [19] 0.056008651 0.054525339 0.049309077 0.040482831 0.037906147 0.034709860# [25] 0.029748166 0.023624321 0.019072278 0.003071851 0.000000000 0.000000000
Use a biplot without the species to represent it!
biplot.pcoa(spe.bray.pcoa)
Compute a PCoA on the Hellinger-transformed mite species abundance data
Be ready to answer:
mite.spe <- mitemite.spe.hel <- decostand(mite.spe, method = "hellinger")
mite.spe.h.pcoa <- pcoa(dist(mite.spe.hel))
biplot.pcoa(mite.spe.h.pcoa, mite.spe.hel)
In PCA and PCoA, objects are ordinated in a few number of dimensions (generally > 2);
2D-biplots may not represent all the variation within the dataset;
Sometimes, we aim at representing the data in a specified smaller number of dimensions;
How can we plot the ordination space to represent the most variation as possible in the data?
In PCA and PCoA, objects are ordinated in a few number of dimensions (generally > 2);
2D-biplots may not represent all the variation within the dataset;
Sometimes, we aim at representing the data in a specified smaller number of dimensions;
How can we plot the ordination space to represent the most variation as possible in the data?
We can attempt using non-metric multidimensional scaling!
NMDS applies an iterative procedure that tries to position the objects in the requested number of dimensions in such a way as to minimize a stress function (scaled from 0 to 1), which measures the goodness-of-fit of the distance adjustment in the reduced-space configuration.
Consequently, the lower the stress value, the better the representation of objects in the ordination-space is.
nMDS is implemented in vegan
as metaMDS()
where:
distance
specifies the distance metric to use;k
specifies the number of dimensions.# Dune meadow vegetation data, dune, has cover class values of 30 species on 20# sites.data(dune)spe.nmds <- metaMDS(dune, distance = 'bray', k = 2)
The Shepard diagram and stress values can be obtained from stressplot()
:
spe.nmds$stress# [1] 0.1183186stressplot(spe.nmds, main = "Shepard plot")
The Shepard plot identifies a strong correlation between observed dissimilarity and ordination distance (R2 > 0.95) highlighting a high goodness of fit of the NMDS.
biplot()
Construct the biplot
plot(spe.nmds, type = "none", main = paste("NMDS/Bray - Stress =", round(spe.nmds$stress, 3)), xlab = c("NMDS1"), ylab = "NMDS2")points(scores(spe.nmds, display = "sites", choices = c(1,2)), pch = 21, col = "black", bg = "steelblue", cex = 1.2)text(scores(spe.nmds, display = "species", choices = c(1)), scores(spe.nmds, display = "species", choices = c(2)), labels = rownames(scores(spe.nmds, display = "species")), col = "red", cex = 0.8)
biplot()
The biplot of the NMDS shows a group of closed sites characterized by the species BLA, TRU, VAI, LOC, CHA and OMB, while the other species form a cluster of sites in the upper right part of the graph.
Four sites in the lower part of the graph are strongly different from the others.
Run the NMDS of the mite.spe
species abundances in 2 dimensions based on a Bray-Curtis distance.
Assess the goodness-of-fit of the ordination and interpret the biplot.
Recall these helpful functions:
?metaMDS?stressplot
Run the NMDS of the mite
species abundances in 2 dimensions based on a Bray-Curtis distance:
mite.nmds <- metaMDS(mite.spe, distance = 'bray', k = 2)
plot(mite.nmds, type = "none", main = paste("NMDS/Bray - Stress =", round(mite.nmds$stress, 3)), xlab = c("NMDS1"), ylab = "NMDS2")points(scores(mite.nmds, display = "sites", choices = c(1,2)), pch = 21, col = "black", bg = "steelblue", cex = 1.2)text(scores(mite.nmds, display = "species", choices = c(1)), scores(mite.nmds, display = "species", choices = c(2)), labels = rownames(scores(mite.nmds, display = "species")), col = "red", cex = 0.8)
There is not really any obvious clustering of sites in the NMDS biplot. This tells us that the species occurred in most of the sites. Only a few sites shelter specific communities.
stressplot(mite.nmds, main = "Shepard plot")
The observed dissimilarity is highly correlated with ordination distance, and the NMDS stress value is relatively low, which tells us that the NMDS ordination is relatively accurate.
Many ordination techniques exist, but their specificity should guide your choices on which methods to use
Distance preserved | Variables | Maximum number of axis | |
---|---|---|---|
PCA | Euclidean | Quantitative data, linear relationships | p |
CA | Chi2 | Non-negative, quantitative homogeneous data, binary data | p-1 |
PCoA | User defined | Quantitative, semi-quantitative, mixed data | p-1 |
NMDS | User defined | Quantitative, semi-quantitative, mixed data | User defined |
What does PCA stand for?
What does PCA stand for?
Principal Component Analysis
What does PCA stand for?
Principal Component Analysis
Which one is the best way to visualize the distances between the community composition of many sites?
What does PCA stand for?
Principal Component Analysis
Which one is the best way to visualize the distances between the community composition of many sites?
Principal Coordinate Analysis (PCoA)
What does PCA stand for?
Principal Component Analysis
Which one is the best way to visualize the distances between the community composition of many sites?
Principal Coordinate Analysis (PCoA)
What does an eigenvalue represent in PCA?
What does PCA stand for?
Principal Component Analysis
Which one is the best way to visualize the distances between the community composition of many sites?
Principal Coordinate Analysis (PCoA)
What does an eigenvalue represent in PCA?
The proportion of variance explained by a principal component
What is wrong with this plot?
What is wrong with this plot?
What is wrong with this plot?
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |