Chapter 10 Transformations
Communities sampled over homogeneous or short environmental conditions can have species compositions with few zeroes, so that Euclidean distances could be enough to characterize them.
Nevertheless, this is rarely the reality.
Species may be highly frequent when conditions are favourable, or may be absent from many sites. Sometimes, this skewness may introduce spurious problems to our analyses.
We may then have to transform our composition data to appropriately analyze it.
In R
, we can rely on vegan::decostand()
for many types of transformations.
Take a look into the help of this function to see the available options:
?decostand()
10.1 Presence-absence transformation
We can change the argument method
to "pa"
in vegdist()
to transform our abundance data into presence-absence data:
Let us recall our spe
data set:
1:6, 1:6] spe[
## CHA TRU VAI LOC OMB BLA
## 1 0 3 0 0 0 0
## 2 0 5 4 3 0 0
## 3 0 5 5 5 0 0
## 4 0 4 5 5 0 0
## 5 0 2 3 2 0 0
## 6 0 3 4 5 0 0
Let us transform spe
abundances to presence-absences:
<- decostand(spe, method = "pa")
spe.pa 1:6, 1:6] spe.pa[
## CHA TRU VAI LOC OMB BLA
## 1 0 1 0 0 0 0
## 2 0 1 1 1 0 0
## 3 0 1 1 1 0 0
## 4 0 1 1 1 0 0
## 5 0 1 1 1 0 0
## 6 0 1 1 1 0 0
10.2 Species profiles transformation
Sometimes, one wants to remove the effects of highly abundant units. We can transform the data into profiles of relative species abundances through the following equation:
\[y'_{ij} = \frac{y_{ij}}{y_{i+}}\]
where, \(yi+\) indicates the sample total count over all \(j=1,…,m\) species, for the \(i\)-th sample.
In decostand()
, we can use the method
with "total"
:
<- decostand(spe, method = "total")
spe.total 1:5, 1:6] spe.total[
## CHA TRU VAI LOC OMB BLA
## 1 0 1.00000000 0.00000000 0.00000000 0 0
## 2 0 0.41666667 0.33333333 0.25000000 0 0
## 3 0 0.31250000 0.31250000 0.31250000 0 0
## 4 0 0.19047619 0.23809524 0.23809524 0 0
## 5 0 0.05882353 0.08823529 0.05882353 0 0
10.3 Hellinger transformation
We can take the square-root of the species profile transformation and obtain the Hellinger transformation, which has very good mathematical properties and allows us to reduce the effects of \(y_{ij}\) values that are extremely large.
\[y'_{ij} = \sqrt{\frac{y_{ij}}{y_{i+}}}\]
In decostand()
, we can use the method
with "hellinger"
:
<- decostand(spe, method = "hellinger")
spe.total 1:5, 1:6] spe.total[
## CHA TRU VAI LOC OMB BLA
## 1 0 1.0000000 0.0000000 0.0000000 0 0
## 2 0 0.6454972 0.5773503 0.5000000 0 0
## 3 0 0.5590170 0.5590170 0.5590170 0 0
## 4 0 0.4364358 0.4879500 0.4879500 0 0
## 5 0 0.2425356 0.2970443 0.2425356 0 0
10.4 Z-score standardization
Z-score standardization, also known as standard score normalization, is a technique used to transform a distribution of data to a standard normal distribution with a mean of 0 and a standard deviation of 1. It involves subtracting the mean of the data and dividing by the standard deviation.
Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units:
## `?`(decostand)
<- decostand(env, method = "standardize") env.z
This centres and scales the variables to make your downstream analysis more appropriate:
apply(env.z, 2, mean)
## das alt pen deb pH
## 1.000429e-16 1.814232e-18 -1.659010e-17 1.233099e-17 -4.096709e-15
## dur pho nit amm oxy
## 3.348595e-16 1.327063e-17 -8.925898e-17 -4.289646e-17 -2.886092e-16
## dbo
## 7.656545e-17
apply(env.z, 2, sd)
## das alt pen deb pH dur pho nit amm oxy dbo
## 1 1 1 1 1 1 1 1 1 1 1
We will see more details about this transformation in the next sections!
10.4.0.1 Little review
Association - “general term to describe any measure or coefficient to quantify the resemblance or difference between objects or descriptors. In an analysis between descriptors, zero means no association.” (Legendre and Legendre 2012).
Similarity - a measure that is “maximum (S=1) when two objects are identical and minimum when two objects are completely different.” (Legendre and Legendre 2012).
Distance (also called dissimilarity) - a measure that is “maximum (D=1) when two objects are completely different”. (Legendre and Legendre 2012). Distance or dissimilarity (D) = 1-S
Choosing an association measure depends on your data, but also on what you know, ecologically about your data.
Here are some commonly used dissimilarity (distance) measures (recreated from Gotelli and Ellison 2004):
Measure name | Property | Description |
---|---|---|
Euclidean | Metric | Distance between two points in 2D space. |
Manhattan | Metric | Distance between two points, where the distance is the sum of differences of their Cartesian coordinates, i.e. if you were to make a right able between the points. |
Chord | Metric | This distance is generally used to assess differences due to genetic drift. |
Mahalanobis | Metric | Distance between a point and a set distribution, where the distance is the number of standard deviations of the point from the mean of the distribution. |
Chi-square | Metric | Similar to Euclidean. |
Bray-Curtis | Semi-metric | Dissimilarity between two samples (or sites) where the sum of lower values for species present in both samples are divided by the sum of the species counted in each sample. |
Jaccard | Metric | Description |
Sorensen’s | Semi-metric | Bray-Curtis is 1 - Sorensen |
10.4.1 Other association metrics
Quantitative environmental data
Let us look at associations between environmental variables (also known as Q mode analysis):
`?`(dist)
# euclidean distance matrix of the standardized
# environmental variables
<- dist(env.z, method = "euclidean")
env.de
windows() # Creates a separate graphical window
coldiss(env.de, diag = TRUE)
We can then look at the dependence between environmental variables (also known as R mode analysis):
<- cor(env)) # Computing Pearson's r among variables
(env.pearson round(env.pearson, 2) # Rounds the coefficients to 2 decimal points
<- cor(env, method = "kendall")) # Kendall's tau rank correlation
(env.ken round(env.ken, 2)
The Pearson correlation measures the linear correlation between two variables. The Kendall tau is a rank correlation which means that it quantifies the relationship between two descriptors or variables when the data are ordered within each variable.
In some cases, there may be mixed types of environmental variables. Q mode can still be used to find associations between these environmental variables. We’ll do this by first creating an example dataframe:
<- rnorm(30, 0, 1)
var.g1 <- runif(30, 0, 5)
var.g2 <- gl(3, 10)
var.g3 <- gl(2, 5, 30)
var.g4
<- data.frame(var.g1, var.g2, var.g3, var.g4))
(dat2
str(dat2)
summary(dat2)
A dissimilarity matrix can be generated for these mixed variables using the Gower dissimilarity matrix:
`?`(daisy #This function can handle NAs in the data
)<- daisy(dat2, metric = "gower"))
(dat2.dg
coldiss(dat2.dg)
Challenge 1 - Advanced Calculate the Bray-Curtis and the Gower dissimilarity of species abundance CHA, TRU and VAI for sites 1, 2 and 3 (using the “spe” and “env” dataframes) without using the decostand() function.
Challenge 1 - Advanced Solution <hidden>
Subset the species data so that only sites 1, 2 are included and only the species CHA, TRU and VAI.
<- spe[1:3, 1:3] #”[1:3,” refers to rows 1 to 3 while “,1:3]” refers to the first 3 species columns (in #this case the three variables of interest) spe.challenge
Determine total species abundance for each site of interest (sum of the 3 rows). This will be for the denominator in the above equation.
<- sum(spe.challenge[1, ]))
(Abund.s1 <- sum(spe.challenge[2, ]))
(Abund.s2 <- sum(spe.challenge[3, ]))
(Abund.s3 # () around code will cause output to print right away in
# console
Now calculate the difference in species abundances for each pair of sites. For example, what is the difference between the abundance of CHA and TRU in site 1? You need to calculate the following differences: CHA and TRU site 1 CHA and VAI site 1 TRU and VAI site 1 CHA and TRU site 2 CHA and VAI site 2 TRU and VAI site 2 CHA and TRU site 3 CHA and VAI site 3 TRU and VAI site 3
<- 0
Spec.s1s2 <- 0
Spec.s1s3 <- 0
Spec.s2s3 for (i in 1:3) {
<- Spec.s1s2 + abs(sum(spe.challenge[1, i] - spe.challenge[2,
Spec.s1s2
i]))<- Spec.s1s3 + abs(sum(spe.challenge[1, i] - spe.challenge[3,
Spec.s1s3
i]))<- Spec.s2s3 + abs(sum(spe.challenge[2, i] - spe.challenge[3,
Spec.s2s3
i])) }
Now take the differences you have calculated as the numerator in the equation for Bray-Curtis dissimilarity and the total species abundance that you already calculated as the denominator.
<- Spec.s1s2/(Abund.s1 + Abund.s2)) #Site 1 compared to site 2
(db.s1s2 <- Spec.s1s3/(Abund.s1 + Abund.s3)) #Site 1 compared to site 3
(db.s1s3 <- Spec.s2s3/(Abund.s2 + Abund.s3)) #Site 2 compared to site 3 (db.s2s3
You should find values of 0.5 for site 1 to site 2, 0.538 for site 1 to site 3 and 0.053 for site 2 to 3.
Check your manual results with what you would find using the function vegdist() with the Bray-Curtis method:
<- vegdist(spe.challenge, method = "bray")) (spe.db.challenge
A matrix looking like this is produced, which should be the same as your manual calculations:
Site 1 | Site 2 | |
---|---|---|
Site 2 | 0.5 | -- |
Site 3 | 0.538 | 0.0526 |
For the Gower dissimilarity, proceed in the same way but use the appropriate equation:
# Calculate the number of columns in your dataset
<- ncol(spe.challenge)
M
# Calculate the species abundance differences between pairs
# of sites for each species
<- abs(spe.challenge[1, 1] - spe.challenge[2, 1])
Spe1.s1s2 <- abs(spe.challenge[1, 2] - spe.challenge[2, 2])
Spe2.s1s2 <- abs(spe.challenge[1, 3] - spe.challenge[2, 3])
Spe3.s1s2 <- abs(spe.challenge[1, 1] - spe.challenge[3, 1])
Spe1.s1s3 <- abs(spe.challenge[1, 2] - spe.challenge[3, 2])
Spe2.s1s3 <- abs(spe.challenge[1, 3] - spe.challenge[3, 3])
Spe3.s1s3 <- abs(spe.challenge[2, 1] - spe.challenge[3, 1])
Spe1.s2s3 <- abs(spe.challenge[2, 2] - spe.challenge[3, 2])
Spe2.s2s3 <- abs(spe.challenge[2, 3] - spe.challenge[3, 3])
Spe3.s2s3
# Calculate the range of each species abundance between
# sites
<- max(spe.challenge[, 1]) - min(spe.challenge[, 1])
Range.spe1 <- max(spe.challenge[, 2]) - min(spe.challenge[, 2])
Range.spe2 <- max(spe.challenge[, 3]) - min(spe.challenge[, 3])
Range.spe3
# Calculate the Gower dissimilarity
<- (1/M) * ((Spe2.s1s2/Range.spe2) + (Spe3.s1s2/Range.spe3)))
(dg.s1s2 <- (1/M) * ((Spe2.s1s3/Range.spe2) + (Spe3.s1s3/Range.spe3)))
(dg.s1s3 <- (1/M) * ((Spe2.s2s3/Range.spe2) + (Spe3.s2s3/Range.spe3)))
(dg.s2s3
# Compare your results
<- vegdist(spe.challenge, method = "gower")) (spe.db.challenge