Principal components analysis

Conceptually, principal component analysis consists of a sequence of finding a unit-length vector such that the projections of a set of k zero-meaned variables onto that vector have the smallest error over all points, i.e., the smallest mean squared distance from the original k-dimensional points. In other words: we want the most efficient representation of the datapoints possible if we have to boil them down to values of one linear combination. We then repeat the step for the remaining error, and keep going until we've explained all or enough variance. The projection vectors we find this way are the principal components: the best, most efficient possible representations of the data, in the sense that they explain the most variance possible at each step.

Mathematically, the principal component at each step is the eigenvector of the covariance matrix with the highest eigenvalue. A clear explanation is given in, e.g., Shalizi (2021), via the following steps:

  1. Firstly, what is a projection? That's the "shadow" of a vector x_i cast onto a given projection space (here, this will simply be a unit projection space vector, p). Visualize a right-angled triangle, with the vector x_i being the hypotenuse and the adjacent side being the projection, a scale vector lying in the direction of p; the opposite side is the "error", the part of x_i that casts no shadow on p. So, you can understand how projections work by manipulating the vector x_i and visualizing how its "shadow" cast on p behaves. If the projection space vector has unit length, the length of the projection is the dot product; and the actual projection is the unit projection space vector scaled by this projection length.
  2. The unit projection space vector, p, with the lowest error is the vector with the highest variance of the (signed) lengths of the projections of the n k-dimensional points x_i onto p, proj_i = p . x_i:
    • ||x_i - proj_i||^2 = x_i . x_i - (p . p) ^ 2
    • MSE(p) = (1/n)(Sum_i^n ||x_i||^2 - sum_i^n (p . x_i)^2)
    The final term is the mean of (p . x_i)^2, which is the square of the mean of (p . x_i) (which is zero with centered variables) plus the variance of (p . x_i). So minimizing the MSE means maximizing the variance of the lengths of the projections.
  3. The variance of the projections can be written as a product involving the projection space vector and the covariance matrix.
    • Var[proj_i] = Var[p . x_i] = (1/n) Sum_i^2 (x_i . p)^2 = (1/n) (Xp)'(Xp) = ... = p' (X'X)/n p = p' Cov[X] p, where X is the matrix containing the points x_i as rows, with each column having zero mean.
  4. The unit-length projection space vector producing the highest variance for the projections can be found using the Lagrange multiplier method, with the unit length as the constraint. This shows that the desired projection space vector must be an eigenvector of the covariance matrix
    • L(p, lambda) = p' Cov[X] p - lambda (p'p - 1)
    • del L/del p = 2 Cov[X] p - 2 lambda p
    • Cov[X] p = lambda p, by setting the partial derivative to 0 to find the optimum.
  5. The Langrange multiplier itself is the eigenvalue, and is equal to the variance of the projection.
    • Var[proj_i] = p' Cov[X] p = p' lambda p = lambda.

Thus, selecting the eigenvector with the highest eigenvalue provides the principal component of the current step. (In fact, although not necessary for the step-by-step definition and proofs, the set of eigenvectors of the covariance matrix produce all the principal components at once. From general linear algebra, if a covariance matrix is symmetrical and non-negative definite, it has orthogonal eigenvectors with non-negative eigenvalues. So, ordered by decreasing eigenvalue, each subsequent eigenvector is the principal component for the nullspace of all preceding eigenvectors.)

The rest of this page is intended to provide information on running PCAs using various methods. The idea is primarily to disambiguate what matrices which methods exactly return. I'll use the following hopefully helpful notation.

Let the observed data be O(N, K), N observations of K variables. The columns of O will generally be correlated. PCA provides a description of O as weighted sums of uncorrelated latent variables L. There are two kinds of weights matrices:

Observed-to-latent weights, O2L
Latent-to-observed weights, L2O

Then O = L * L2O and L = O * O2L. Using the ICA terminology, the Mixing matrix is L2O and the Unmixing matrix is O2L: components are considered to be mixed and then observed as signals.

Calculating principal components

princomp

Let

O = randn(100, 9);
Om = O - ones(size(O, 1), 1) * mean(O);
[O2L, L, EV] = princomp(O);

Then Om * O2L = L.

Note that the de-meaned Om must be used for the equality; princomp does the de-meaning automatically.

The L2O mixing matrix can be calculated from the O2L matrix by

L2O = inv(O2L' * O2L) * O2L';

Then L * L2O = Om.

The eigenvalues (EV) are equal to the variance of the latent variables L.

Eigendecomposition of the covariance matrix

Using the eigendecomposition of the covariance matrix of O gives the same results, but the columns are sorted in reverse (from low to high eigenvalue) and the weights may be multiplied by -1:

Ocov = cov(O);
[O2L, EV] = eig(Ocov);

A proof of why this eigendecomposition works is given on Wikipedia.

NIPALS algorithm

Implemented in teg_PCA, this works differently: it finds the L2O matrix iteratively. The O2L matrix is subsequently calculated from the L2O matrix as O2L = L2O' * inv(L2O * L2O');

[L, L2O, O2L, EV] = teg_PCA(O);

Then L * L2O = Om. The main advantage here is that the covariance matrix doesn't need to be held in memory, and the home-made function can be further optimized for memory usage.

VARIMAX rotation

O2Lr = rotatefactors(O2L, 'method', 'varimax') gives the varimax rotation. This maximizes the variance of the columns of the new O2L matrix. That is, the resulting rotated latent variables in Lr = O * O2Lr are based on simple combinations of variables.

Quartimax maximizes row-wise variance. Equimax combines the previous two criteria. However, it seems that most often the goal will be to have components that can be easily understood in terms of variables, which is what varimax provides.

After finding a rotation matrix, the following relationship holds:

Om * O2Lr = Lr

The scree criterion (knik criterium)

An objective criterion for selecting the "top" components, i.e. those with the highest eigenvalues is the scree criterion. This determines at which point the slope of the ordered eigenvalues passes through 1 (after normalizing the eigenvalues and index numbers to a [0 1] range). Matlab code here.