Chapter 10 Transformations

Communities sampled over homogeneous or short environmental conditions can have species compositions with few zeroes, so that Euclidean distances could be enough to characterize them.

Nevertheless, this is rarely the reality.

Species may be highly frequent when conditions are favourable, or may be absent from many sites. Sometimes, this skewness may introduce spurious problems to our analyses.

We may then have to transform our composition data to appropriately analyze it.

In R, we can rely on vegan::decostand() for many types of transformations.

Take a look into the help of this function to see the available options:

?decostand()

10.1 Presence-absence transformation

We can change the argument method to "pa" in vegdist() to transform our abundance data into presence-absence data:

If \(y_{ij} \geq 1\), then, \(y'_{ij} = 1\).

Let us recall our spe data set:

spe[1:6, 1:6]
##   CHA TRU VAI LOC OMB BLA
## 1   0   3   0   0   0   0
## 2   0   5   4   3   0   0
## 3   0   5   5   5   0   0
## 4   0   4   5   5   0   0
## 5   0   2   3   2   0   0
## 6   0   3   4   5   0   0

Let us transform spe abundances to presence-absences:

spe.pa <- decostand(spe, method = "pa")
spe.pa[1:6, 1:6]
##   CHA TRU VAI LOC OMB BLA
## 1   0   1   0   0   0   0
## 2   0   1   1   1   0   0
## 3   0   1   1   1   0   0
## 4   0   1   1   1   0   0
## 5   0   1   1   1   0   0
## 6   0   1   1   1   0   0

10.2 Species profiles transformation

Sometimes, one wants to remove the effects of highly abundant units. We can transform the data into profiles of relative species abundances through the following equation:

\[y'_{ij} = \frac{y_{ij}}{y_{i+}}\]

where, \(yi+\) indicates the sample total count over all \(j=1,…,m\) species, for the \(i\)-th sample.

In decostand(), we can use the method with "total":

spe.total <- decostand(spe, method = "total")
spe.total[1:5, 1:6]
##   CHA        TRU        VAI        LOC OMB BLA
## 1   0 1.00000000 0.00000000 0.00000000   0   0
## 2   0 0.41666667 0.33333333 0.25000000   0   0
## 3   0 0.31250000 0.31250000 0.31250000   0   0
## 4   0 0.19047619 0.23809524 0.23809524   0   0
## 5   0 0.05882353 0.08823529 0.05882353   0   0

10.3 Hellinger transformation

We can take the square-root of the species profile transformation and obtain the Hellinger transformation, which has very good mathematical properties and allows us to reduce the effects of \(y_{ij}\) values that are extremely large.

\[y'_{ij} = \sqrt{\frac{y_{ij}}{y_{i+}}}\]

In decostand(), we can use the method with "hellinger":

spe.total <- decostand(spe, method = "hellinger")
spe.total[1:5, 1:6]
##   CHA       TRU       VAI       LOC OMB BLA
## 1   0 1.0000000 0.0000000 0.0000000   0   0
## 2   0 0.6454972 0.5773503 0.5000000   0   0
## 3   0 0.5590170 0.5590170 0.5590170   0   0
## 4   0 0.4364358 0.4879500 0.4879500   0   0
## 5   0 0.2425356 0.2970443 0.2425356   0   0

10.4 Z-score standardization

Z-score standardization, also known as standard score normalization, is a technique used to transform a distribution of data to a standard normal distribution with a mean of 0 and a standard deviation of 1. It involves subtracting the mean of the data and dividing by the standard deviation.

Standardizing environmental variables is crucial as you cannot compare the effects of variables with different units:

## `?`(decostand)
env.z <- decostand(env, method = "standardize")

This centres and scales the variables to make your downstream analysis more appropriate:

apply(env.z, 2, mean)
##           das           alt           pen           deb            pH 
##  1.000429e-16  1.814232e-18 -1.659010e-17  1.233099e-17 -4.096709e-15 
##           dur           pho           nit           amm           oxy 
##  3.348595e-16  1.327063e-17 -8.925898e-17 -4.289646e-17 -2.886092e-16 
##           dbo 
##  7.656545e-17
apply(env.z, 2, sd)
## das alt pen deb  pH dur pho nit amm oxy dbo 
##   1   1   1   1   1   1   1   1   1   1   1

We will see more details about this transformation in the next sections!

10.4.0.1 Little review

Association - “general term to describe any measure or coefficient to quantify the resemblance or difference between objects or descriptors. In an analysis between descriptors, zero means no association.” (Legendre and Legendre 2012).

Similarity - a measure that is “maximum (S=1) when two objects are identical and minimum when two objects are completely different.” (Legendre and Legendre 2012).

Distance (also called dissimilarity) - a measure that is “maximum (D=1) when two objects are completely different”. (Legendre and Legendre 2012). Distance or dissimilarity (D) = 1-S

Choosing an association measure depends on your data, but also on what you know, ecologically about your data.

Here are some commonly used dissimilarity (distance) measures (recreated from Gotelli and Ellison 2004):

Measure name Property Description
Euclidean Metric Distance between two points in 2D space.
Manhattan Metric Distance between two points, where the distance is the sum of differences of their Cartesian coordinates, i.e. if you were to make a right able between the points.
Chord Metric This distance is generally used to assess differences due to genetic drift.
Mahalanobis Metric Distance between a point and a set distribution, where the distance is the number of standard deviations of the point from the mean of the distribution.
Chi-square Metric Similar to Euclidean.
Bray-Curtis Semi-metric Dissimilarity between two samples (or sites) where the sum of lower values for species present in both samples are divided by the sum of the species counted in each sample.
Jaccard Metric Description
Sorensen’s Semi-metric Bray-Curtis is 1 - Sorensen

10.4.1 Other association metrics

Quantitative environmental data

Let us look at associations between environmental variables (also known as Q mode analysis):

`?`(dist)

# euclidean distance matrix of the standardized
# environmental variables
env.de <- dist(env.z, method = "euclidean")

windows()  # Creates a separate graphical window
coldiss(env.de, diag = TRUE)

We can then look at the dependence between environmental variables (also known as R mode analysis):

(env.pearson <- cor(env))  # Computing Pearson's r among variables
round(env.pearson, 2)  # Rounds the coefficients to 2 decimal points 
(env.ken <- cor(env, method = "kendall"))  # Kendall's tau rank correlation
round(env.ken, 2)

The Pearson correlation measures the linear correlation between two variables. The Kendall tau is a rank correlation which means that it quantifies the relationship between two descriptors or variables when the data are ordered within each variable.

In some cases, there may be mixed types of environmental variables. Q mode can still be used to find associations between these environmental variables. We’ll do this by first creating an example dataframe:

var.g1 <- rnorm(30, 0, 1)
var.g2 <- runif(30, 0, 5)
var.g3 <- gl(3, 10)
var.g4 <- gl(2, 5, 30)

(dat2 <- data.frame(var.g1, var.g2, var.g3, var.g4))

str(dat2)
summary(dat2)

A dissimilarity matrix can be generated for these mixed variables using the Gower dissimilarity matrix:

`?`(daisy  #This function can handle NAs in the data
)
(dat2.dg <- daisy(dat2, metric = "gower"))

coldiss(dat2.dg)

Challenge 1 - Advanced Calculate the Bray-Curtis and the Gower dissimilarity of species abundance CHA, TRU and VAI for sites 1, 2 and 3 (using the “spe” and “env” dataframes) without using the decostand() function.

Challenge 1 - Advanced Solution <hidden>

Subset the species data so that only sites 1, 2 are included and only the species CHA, TRU and VAI.

spe.challenge <- spe[1:3, 1:3]  #”[1:3,” refers to rows 1 to 3 while “,1:3]” refers to the first 3 species columns (in #this case the three variables of interest)

Determine total species abundance for each site of interest (sum of the 3 rows). This will be for the denominator in the above equation.

(Abund.s1 <- sum(spe.challenge[1, ]))
(Abund.s2 <- sum(spe.challenge[2, ]))
(Abund.s3 <- sum(spe.challenge[3, ]))
# () around code will cause output to print right away in
# console

Now calculate the difference in species abundances for each pair of sites. For example, what is the difference between the abundance of CHA and TRU in site 1? You need to calculate the following differences: CHA and TRU site 1 CHA and VAI site 1 TRU and VAI site 1 CHA and TRU site 2 CHA and VAI site 2 TRU and VAI site 2 CHA and TRU site 3 CHA and VAI site 3 TRU and VAI site 3

Spec.s1s2 <- 0
Spec.s1s3 <- 0
Spec.s2s3 <- 0
for (i in 1:3) {
    Spec.s1s2 <- Spec.s1s2 + abs(sum(spe.challenge[1, i] - spe.challenge[2,
        i]))
    Spec.s1s3 <- Spec.s1s3 + abs(sum(spe.challenge[1, i] - spe.challenge[3,
        i]))
    Spec.s2s3 <- Spec.s2s3 + abs(sum(spe.challenge[2, i] - spe.challenge[3,
        i]))
}

Now take the differences you have calculated as the numerator in the equation for Bray-Curtis dissimilarity and the total species abundance that you already calculated as the denominator.

(db.s1s2 <- Spec.s1s2/(Abund.s1 + Abund.s2))  #Site 1 compared to site 2
(db.s1s3 <- Spec.s1s3/(Abund.s1 + Abund.s3))  #Site 1 compared to site 3
(db.s2s3 <- Spec.s2s3/(Abund.s2 + Abund.s3))  #Site 2 compared to site 3 

You should find values of 0.5 for site 1 to site 2, 0.538 for site 1 to site 3 and 0.053 for site 2 to 3.

Check your manual results with what you would find using the function vegdist() with the Bray-Curtis method:

(spe.db.challenge <- vegdist(spe.challenge, method = "bray"))

A matrix looking like this is produced, which should be the same as your manual calculations:

Site 1 Site 2
Site 2 0.5 --
Site 3 0.538 0.0526

For the Gower dissimilarity, proceed in the same way but use the appropriate equation:

# Calculate the number of columns in your dataset
M <- ncol(spe.challenge)

# Calculate the species abundance differences between pairs
# of sites for each species
Spe1.s1s2 <- abs(spe.challenge[1, 1] - spe.challenge[2, 1])
Spe2.s1s2 <- abs(spe.challenge[1, 2] - spe.challenge[2, 2])
Spe3.s1s2 <- abs(spe.challenge[1, 3] - spe.challenge[2, 3])
Spe1.s1s3 <- abs(spe.challenge[1, 1] - spe.challenge[3, 1])
Spe2.s1s3 <- abs(spe.challenge[1, 2] - spe.challenge[3, 2])
Spe3.s1s3 <- abs(spe.challenge[1, 3] - spe.challenge[3, 3])
Spe1.s2s3 <- abs(spe.challenge[2, 1] - spe.challenge[3, 1])
Spe2.s2s3 <- abs(spe.challenge[2, 2] - spe.challenge[3, 2])
Spe3.s2s3 <- abs(spe.challenge[2, 3] - spe.challenge[3, 3])

# Calculate the range of each species abundance between
# sites
Range.spe1 <- max(spe.challenge[, 1]) - min(spe.challenge[, 1])
Range.spe2 <- max(spe.challenge[, 2]) - min(spe.challenge[, 2])
Range.spe3 <- max(spe.challenge[, 3]) - min(spe.challenge[, 3])

# Calculate the Gower dissimilarity
(dg.s1s2 <- (1/M) * ((Spe2.s1s2/Range.spe2) + (Spe3.s1s2/Range.spe3)))
(dg.s1s3 <- (1/M) * ((Spe2.s1s3/Range.spe2) + (Spe3.s1s3/Range.spe3)))
(dg.s2s3 <- (1/M) * ((Spe2.s2s3/Range.spe2) + (Spe3.s2s3/Range.spe3)))

# Compare your results
(spe.db.challenge <- vegdist(spe.challenge, method = "gower"))