Chapter 13 Principal Component Analysis
Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset while retaining most of its variability. It is a linear transformation method that converts the original set of variables into a new set of linearly uncorrelated variables, called principal components (PCs), which are sorted in decreasing order of variance.
PCA was first introduced by Karl Pearson in 1901, who developed the mathematical foundation for the method. Later, Harold Hotelling (1933) provided a more detailed and modern interpretation of the PCA method.
PCA has become one of the most commonly used techniques in data analysis due to its ability to identify hidden patterns and reduce the complexity of high-dimensional data.
In essence, PCA aims to find the linear combinations of the original variables that account for the largest possible amount of variation in the dataset. The resulting principal components are orthogonal to each other, meaning that they are not correlated, and their order reflects their importance in explaining the variability of the data.
To become comfortable with PCA, we will follow with detailed examples on how to perform it step-by-step, and then we will use functions from R packages to do it.
13.1 Principal component analysis not in a nutshell
Suppose we have a dataset with \(n\) observations and \(p\) variables represented by an n x p matrix \(X\). The goal of PCA is to transform this dataset into a new set of \(p\) uncorrelated variables, called principal components (PCs), which capture the maximum amount of variance in the original data.
13.1.0.1 Load data
In this workshop, we use the data(varechem)
dataset, which contains measurements of chemical properties of 18 soil samples from a field experiment. We will select the first two variables:
# Load the datasets package
library(datasets)
# Load the varechem dataset
data(varechem)
# Select data
<- varechem[, 1:2]) (data
## N P
## 18 19.8 42.1
## 15 13.4 39.1
## 24 20.2 67.7
## 27 20.6 60.8
## 23 23.8 54.5
## 19 22.8 40.9
## 22 26.6 36.7
## 16 24.2 31.0
## 28 29.8 73.5
## 13 28.1 40.5
## 14 21.8 38.1
## 20 26.2 61.9
## 25 22.8 50.6
## 7 30.5 24.6
## 5 33.1 22.7
## 6 19.1 26.4
## 3 31.1 32.3
## 4 18.0 64.9
## 2 22.3 47.4
## 9 15.0 48.4
## 12 16.0 32.7
## 10 14.3 62.8
## 11 16.7 55.8
## 21 21.0 26.5
13.1.0.2 Standardize data
We first need to standardize the data to have mean zero and unit variance:
\[ Z_{ij} = \frac{X_{ij} - \bar{X_j}}{s_j} \]
where Z is the standardized matrix, X is the original matrix, \(\bar{X_j}\) is the mean of variable j, and \(s_j\) is the standard deviation of variable j.
<- scale(data) data_std
13.1.0.3 Compute the covariance matrix
Next, we compute the covariance matrix of \(Z\):
\[ C = \frac{1}{n-1}ZZ^T \]
where \(C\) is the covariance matrix and \(T\) denotes the transpose operation.
The covariance matrix is a symmetric matrix that represents the pairwise covariances between the variables. The formula for the covariance between two variables \(X\) and \(Y\) is:
\[\text{Cov}(X,Y) = \frac{1}{n-1}\sum_{i=1}^{n}(X_i - \bar{X})(Y_i - \bar{Y})\]
where \(n\) is the sample size, \(X_i\) and \(Y_i\) are the values of the variables for observation $i$, and \(\bar{X}\) and \(\bar{Y}\) are the sample means of the variables.
<- cov(data_std) cov_matrix
13.1.0.4 Perform the Eigendecomposition of the covariance matrix
Then, we calculate the eigenvalues and eigenvectors of \(C\):
\[ Cv = \lambda v \]
where \(C\) is the covariance matrix, \(v\) is the eigenvector and \(\lambda\) is the corresponding eigenvalue.
<- eigen(cov_matrix)
eigen_decomp <- eigen_decomp$values
Eigenvalues <- eigen_decomp$vectors Eigenvectors
The eigenvectors represent the directions in the \(p\)-dimensional space that capture the maximum amount of variance in the data, and the eigenvalues indicate the amount of variance captured by each eigenvector.
13.1.0.5 Project the standardized data onto the Eigenspace
Finally, we project the standardized data matrix \(Z\) onto the new basis vectors to obtain the principal components. Here, we’ll calculate the principal component scores by multiplying the standardized data by the eigenvectors of all principal components:
\[ Y = ZV \]
where \(Y\) is the transformed data matrix, and \(V\) is the matrix of eigenvectors.
<- data_std %*% Eigenvectors
F_PrComps
head(F_PrComps)
## [,1] [,2]
## 18 0.1894957 0.4713816
## 15 0.8662023 1.4319452
## 24 1.3494546 -0.7909067
## 27 0.9718543 -0.5156358
## 23 0.2644868 -0.6269034
## 19 -0.2510109 0.1444178
The score matrix, \(F\), (object F_PrComps
) allows one to rotate the new data space, so it is represented in relation to the principal components. For instance, see the figure below:
13.2 Principal component analysis using package functions
PCA can also be computed using the stats::prcomp()
, stats::princomp()
, vegan::rda()
, and ade4::dudi.pca()
functions.
In a nutshell, this is what we have done:
data(varechem)
<- varechem[, 1:2]
Y <- as.matrix(scale(Y))
Y_std <- cov(Y_std)
Y_R
<- eigen(Y_R)$values
Eigenvalues <- eigen(Y_R)$vectors
Eigenvectors
<- Y_std %*% Eigenvectors
F_PrComps
head(F_PrComps)
## [,1] [,2]
## 18 0.1894957 0.4713816
## 15 0.8662023 1.4319452
## 24 1.3494546 -0.7909067
## 27 0.9718543 -0.5156358
## 23 0.2644868 -0.6269034
## 19 -0.2510109 0.1444178
How, how does it compare to stats::prcomp()
?
<- prcomp(Y, center = TRUE, scale = TRUE)
PCA_prcomp
# or PCA_prcomp <- prcomp(Y_std)
head(PCA_prcomp$x)
## PC1 PC2
## 18 -0.1894957 -0.4713816
## 15 -0.8662023 -1.4319452
## 24 -1.3494546 0.7909067
## 27 -0.9718543 0.5156358
## 23 -0.2644868 0.6269034
## 19 0.2510109 -0.1444178
And, how does it compare to stats::princomp()
?
<- princomp(Y_std)
PCA_princomp
head(PCA_princomp$scores)
## Comp.1 Comp.2
## 18 -0.1894957 -0.4713816
## 15 -0.8662023 -1.4319452
## 24 -1.3494546 0.7909067
## 27 -0.9718543 0.5156358
## 23 -0.2644868 0.6269034
## 19 0.2510109 -0.1444178
And to vegan::rda()
?
<- rda(Y_std)
PCA_vegan_rda
scores(PCA_vegan_rda, display = "sites", scaling = 1, choices = seq_len(PCA_vegan_rda$CA$rank),
const = sqrt(PCA_vegan_rda$tot.chi * (nrow(PCA_vegan_rda$CA$u) -
1)))[1:5, ]
## PC1 PC2
## 18 -0.1894957 -0.4713816
## 15 -0.8662023 -1.4319452
## 24 -1.3494546 0.7909067
## 27 -0.9718543 0.5156358
## 23 -0.2644868 0.6269034
vegan::rda()
is a bit special. It uses alternative scalings. We will not cover them here, but you can study the vignette("decision-vegan")
.
13.3 Principal component analysis on ecological data
We have implemented PCA on a two-variables dataset, for simplicity.
Let us advance and apply it to our fish species dataset.
For this, we will use the vegan::rda()
function on the Hellinger-transformed fish data and summarise the results:
<- rda(spe.hel)
spe.h.pca
# summary(spe.h.pca)
The first lines of summary.rda()
tell us about the Total variance and Unconstrained variance in our model.
## [1] "Partitioning of variance:" " Inertia Proportion"
## [3] "Total 0.5023 1" "Unconstrained 0.5023 1"
## [1] "Importance of components:"
## [2] " PC1 PC2 PC3 PC4 PC5 PC6 PC7"
## [3] "Eigenvalue 0.2491 0.06592 0.04615 0.03723 0.02125 0.01662 0.01477"
## [4] "Proportion Explained 0.4959 0.13122 0.09188 0.07412 0.04230 0.03309 0.02940"
## [5] "Cumulative Proportion 0.4959 0.62715 0.71903 0.79315 0.83544 0.86853 0.89794"
## [6] " PC14 PC15 PC16 PC17 PC18 PC19"
## [7] "Eigenvalue 0.002612 0.001505 0.001387 0.001037 0.0007815 0.0004749"
## [8] "Proportion Explained 0.005200 0.002996 0.002761 0.002065 0.0015557 0.0009454"
## [9] "Cumulative Proportion 0.987229 0.990225 0.992986 0.995051 0.9966069 0.9975523"
This is followed by the Eigenvalues, and their contribution to the variance.
In fact, if we sum all our Eigenvalues, we will obtain the amount of uncostrained variance explained by the analysis!
sum(spe.h.pca$CA$eig)
## [1] 0.5023429
The next information is related to the scaling, to the species scores, and to the site scores.
## [1] "Eigenvalue 0.0004263 0.0002812 0.0002188 0.0001382 0.0000876"
## [2] "Proportion Explained 0.0008487 0.0005598 0.0004356 0.0002752 0.0001744"
## [3] "Cumulative Proportion 0.9984010 0.9989608 0.9993965 0.9996717 0.9998460"
## [4] " PC25 PC26 PC27"
## [5] "Proportion Explained 1.062e-04 2.938e-05 1.835e-05"
## [6] "Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00"
## [7] "Scaling 2 for species and site scores"
## [8] "* Species are scaled proportional to eigenvalues"
## [9] "* Sites are unscaled: weighted dispersion equal on all dimensions"
## [10] "* General scaling constant of scores: 1.953663 "
## [11] ""
## [12] ""
## [13] "Species scores"
## [14] "BCO -0.20055 0.08332 -0.074787 -0.0504875 0.073890 0.0249842"
## [15] "PCH -0.14626 0.05268 -0.072012 -0.0432572 0.050318 0.0178776"
## [16] "GAR -0.35085 -0.09353 0.198664 0.0178669 0.023796 -0.0971362"
## [17] "BBO -0.24167 0.03598 -0.079528 -0.0339049 0.096690 0.0620979"
## [18] "ABL -0.42269 -0.22879 0.007158 0.1128353 0.006759 0.1248913"
## [19] "ANG -0.20521 0.11557 -0.072060 -0.0159902 0.072030 -0.0003801"
## [20] ""
## [21] ""
## [22] "Site scores (weighted sums of species scores)"
Species refer to your descriptors (i.e. the columns in your dataset), which here are the fish species.
Scores refer to the position of every species along the principal components.
## [1] "Cumulative Proportion 1.000e+00 1.000e+00 1.000e+00"
## [2] "Scaling 2 for species and site scores"
## [3] "* Species are scaled proportional to eigenvalues"
## [4] "* Sites are unscaled: weighted dispersion equal on all dimensions"
## [5] "* General scaling constant of scores: 1.953663 "
## [6] ""
## [7] ""
## [8] "Species scores"
## [1] "PCH -0.14626 0.05268 -0.072012 -0.0432572 0.050318 0.0178776"
## [2] "GAR -0.35085 -0.09353 0.198664 0.0178669 0.023796 -0.0971362"
## [3] "BBO -0.24167 0.03598 -0.079528 -0.0339049 0.096690 0.0620979"
## [4] "ABL -0.42269 -0.22879 0.007158 0.1128353 0.006759 0.1248913"
## [5] "ANG -0.20521 0.11557 -0.072060 -0.0159902 0.072030 -0.0003801"
## [6] ""
## [7] ""
## [8] "Site scores (weighted sums of species scores)"
Sites represent the rows in your dataset, which here are the different sites along the Doubs river.
This information can be obtained with the score()
function that we used before:
scores(spe.h.pca,
display = "species" or "sites")
13.4 Condensing data with principal component analysis
Here, we have 27 principal components. However, we can apply algorithms to select the lowest number of principal components that still account for a large variance in the data.
13.4.0.1 Kaiser-Guttman criterion
We can select the principal components that capture more variance than the average explanation of all principal components. We do this by:
Extracting the Eigenvalues associated to the principal components;
Subsetting the Eigenvalues above the mean Eigenvalue:
<- spe.h.pca$CA$eig
ev # ev[ev > mean(ev)]
<- length(ev)
n barplot(ev, main = "", col = "grey", las = 2)
abline(h = mean(ev), col = "red3", lwd = 2)
legend("topright", "Average eigenvalue", lwd = 2, col = "red3",
bty = "n")
13.4.0.2 Broken-stick model
The broken-stick model retains components that explain more variance than would be expected by randomly dividing the variance into \(p\) parts.
head(bstick(spe.h.pca))
## PC1 PC2 PC3 PC4 PC5 PC6
## 0.07240169 0.05379640 0.04449375 0.03829199 0.03364067 0.02991961
screeplot(spe.h.pca, bstick = TRUE, type = "lines")
13.5 Scaling
All that is left is to discuss scaling and to visualize our results.
Let us practice and compute a PCA on the standardized environmental variables for the same dataset.
<- rda(env.z)
env.pca # summary(env.pca, scaling = 2)
Determine our subset of Eigenvalues and their corresponding Eigenvectors:
<- env.pca$CA$eig ev
> mean(ev)] ev[ev
## PC1 PC2 PC3
## 5.968749 2.163818 1.065164
The information computed by the PCA can be represented with biplots.
We can produce a quick and dirty biplot of the PCA using the function plot()
in base R
.
plot(spe.h.pca)
biplot()
from base
R
allows for a better interpretation.
biplot(spe.h.pca)
The arrows are plotted to show the directionality and angle of the descriptors in the ordination.
- Descriptors at 180 degrees of each other are negatively correlated;
- Descriptors at 90 degrees of each other have zero correlation;
- Descriptors at 0 degrees of each other are positively correlated.
Type 2 scaling (default
): distances among objects are not approximations of Euclidean distances; angles between descriptor (species) vectors reflect their correlations.
biplot(spe.h.pca, scaling = 2)
Type 1 scaling: attempts to preserve the Euclidean distance (in multidimensional space) among objects (sites): the angles among descriptor (species) vector are not meaningful.
biplot(spe.h.pca, scaling = 1)
13.6 Challenge #2
Using everything you have learned, compute a PCA on the mite species abundance data
data(mite)
Be ready to discuss and answer:
- What are the most relevant principal components, i.e. subset them?
- Which groups of sites can you identify?
- Which groups of species are related to these groups of sites?
Challenge 2 - Solution
Compute PCA on the Hellinger-transformed species data
<- decostand(mite, method = "hellinger")
mite.spe.hel
<- rda(mite.spe.hel) mite.spe.h.pca
Apply the Kaiser-Guttman criterion
<- mite.spe.h.pca$CA$eig
ev > mean(ev)]
ev[ev <- length(ev)
n barplot(ev, main = "Eigenvalues", col = "grey", las = 2)
abline(h = mean(ev), col = "red3", lwd = 2)
legend("topright", "Average eigenvalue", lwd = 2, col = "red3",
bty = "n")
biplot(mite.spe.h.pca, col = c("red3", "grey15"))