Chapter 11 Clustering

One application of association matrices is clustering. Clustering highlights structures in the data by partitioning either the objects or the descriptors. As a result, similar objects are combined into groups, allowing distinctions – or contrasts – between groups to be identified. One goal of ecologists could be to divide a set of sitesinto groups with respect to their environmental conditions or their community composition.

Clustering results are often represented as dendrograms (trees), where objects agglomerate into groups. There are several families of clustering methods, but for the purpose of this workshop, we will present an overview of three hierarchical agglomerative clustering methods: single linkage, complete linkage, and Ward’s minimum variance clustering. See Chapter 8 of Legendre and Legendre 2012 for more details on the different families of clustering methods.

In hierarchical methods, elements of lower clusters (or groups) become members of larger, higher ranking clusters, e.g. species, genus, family, order. Prior to clustering, one needs to create an association matrix among the objects. Distance matrix is the default choice of clustering functions in R. The association matrix is first sorted in increasing distance order (or decreasing similarities). Then, groups are formed hierarchically following rules specific to each method.

11.1 Single linkage agglomerative clustering

Let's take a simple example of a euclidean distance matrix between 5 objets which were sorted in ascending order.

In single linkage agglomerative clustering (also called nearest neighbour sorting), the objects at the closest distances agglomerate. The two closest objects agglomerate first, the next two closest objects/clusters are then linked, and so on, which often generates long thin clusters or chains of objects (see how objets 1 to 5 cluster successively). Conversely, in complete linkage agglomerative clustering, an object agglomerates to a group only when linked to the furthest element of the group, which in turn links it to all members of that group (in the above example, the group formed of objets 3 and 4 only clusters with group 1-2 at 0.4, a distance at which objets 1, 2, 3 and 4 are all linked). As a result, complete linkage clustering will form many small separate groups, and is more appropriate to look for contrasts, discontinuities in the data.

11.2 Complete linkage agglomerative clustering

Let’s compare the single and complete linkage clustering methods using the Doubs fish species data.

Species data were already Hellinger-transformed. The cluster analysis requiring similarity or dissimilarity indices, the first step will be to generate the Hellinger distance indices.

spe.dhel <- vegdist(spe.hel, method = "euclidean")  #generates the distance matrix from Hellinger transformed data

# See difference between the two matrices
head(spe.hel)  # Hellinger-transformed species data
head(spe.dhel)  # Hellinger distances among sites

Most clustering methods can be computed with the function hclust() of the stats package.

# Faire le groupement à liens simples Perform single
# linkage clustering
spe.dhel.single <- hclust(spe.dhel, method = "single")
plot(spe.dhel.single)

# Perform complete linkage clustering
spe.dhel.complete <- hclust(spe.dhel, method = "complete")
plot(spe.dhel.complete)

Are there big differences between the two dendrograms?

In single linkage clustering, chains of objets occur (e.g. 19, 29, 30, 20, 26, etc.), whereas more contrasted groups are formed in the complete linkage clustering.

11.3 Ward’s minimum variance clustering

Ward’s minimum variance clustering differ from these two methods in that it clusters objects into groups using the criterion of least squares (similar to linear models). At the beginning, each object is considered being a cluster of its own. At each step, the pair of clusters merging is the one leading to minimum increase in total within-group sum of squares.

Again, it is possible to generate a Ward’s minimum variance clustering with hclust(). However, the dendogram shows squared distances by default. In order to compare this dendrogram to the single and complete linkage clustering results, one must calculate the square root of the distances.

# Perform Ward minimum variance clustering
spe.dhel.ward <- hclust(spe.dhel, method = "ward.D2")
plot(spe.dhel.ward)

# Re-plot the dendrogram by using the square roots of the
# fusion levels
spe.dhel.ward$height <- sqrt(spe.dhel.ward$height)
plot(spe.dhel.ward)
plot(spe.dhel.ward, hang = -1)  # hang=-1 aligns all objets on the same line

The clusters generated using Ward’s method tend to be more spherical and to contain similar numbers of objects.

One must be careful in the choice of an association measure and clustering method in order to correctly address a problem. What are you most interested in: gradients? Contrasts? In addition, the results should be interpreted with respect to the properties of the method used. If more than one method seems suitable to an ecological question, computing them all and compare the results would be the way to go. As a reminder, clustering is not a statistical method, but further steps can be taken to identify interpretable clusters (e.g. where to cut the tree), or to compute clustering statistics. Clustering can also be combined to ordination in order to distinguish groups of sites. These go beyond the scope of this workshop, but see Borcard et al. 2011 for more details.