Principal Component Analysis explained

Khristian Kotov

Introduction

Principal component analysis (PCA) is one of the most important tools of data analysis. Thanks to an intuitive R built-in prcomp function I never had to remember related undergraduate-level linear algebra material. If someone asked me what the principle components are, I'd answer that these are the directions of maximum variation of the data, but wouldn't say more. So I decided to reduce my ignorance and review the method in more detail.

On the following pages we assume:

vectors (or points) in \(d\)-dimensional space are \(d \times 1\) matrices (i.e. columns)
the data are centered (otherwise we can always shift the origin)

PCA in a nutshell

Formulation of the problem:

An \(n \times d\) matrix \(X\) is a data set of \(n\) points \(\mathbf{x}^T\) (rows) in \(d\)-dimensional space (columns).
We need to find a smaller subspace that captures most of the variance of \(X\) (i.e. if our points sit on a line, this line is sufficient to describe all of our data).

Solution:

Let \(\mathbf{v}\) is a unit vector defining some \(1D\) subspace
Projections of points \(\mathbf{x}\) on \(\mathbf{v}\) are given by vector \(X\mathbf{v}\); for centered data \(\sum X\mathbf{v} = 0\)
Easy to see that \(\mathbf{v}\) maximizing the variance \(\sigma_\mathbf{v}^2 = \frac{1}{n-1}\sum|X\mathbf{v}|^2 = \frac{1}{n-1} \mathbf{v}^T \sum X^T X \mathbf{v}\) is an eigenvector of the covariance matrix: \(\left(\frac{1}{n-1} \sum X^T X\right) \mathbf{v} = \sigma_\mathbf{v}^2 \mathbf{v}\)
\(d\) eigenvectors is an orthonormal basis; vectors with high eigenvalues capture more variance

Rotation of the basis

Each point \(\mathbf{x}_i\) can be decomposed in basis of eigenvector \(\mathbf{v}_j\) such that: \(\mathbf{x}_i = \sum_{j=0}^d p_{ij} \mathbf{v}_j\)

Projections \(p_{ij}\) are proportional to \(\sigma_j\) and can be further decomposed as: \(p_{ij} = u_{ij} \sigma_j\)

Now it takes the canonical form known as Singular Value Decomposition (SVD): \(X = U D V^T\)

If \(X\) is degenerate our basis \(\mathbf{v}\) reduces to just \(r\) vectors, where \(r\) is the rank of \(X\)

Terminology:

\(V\) is referred to as left singular vectors also known as principal components
\(D\), the singular values, are just square-rooted variances along the matrix diagonal
\(U\), the right singular vectors, also an orthonormal set with little practical importance here
\(P = U D\) are known as scores

For simplicity assume SVD is a rotation to axises ordered to capture most of the data spread

SVD matrix multiplication

Practical demonstration in R (1)

Let us first generate multivariate gaussians with strong correlation:

require(MASS)
xy <- mvrnorm( 1000000, c(1,2), matrix(c(3,2,2,4),ncol=2) )
plot(xy[sample(nrow(xy),10000),], xlab="x", ylab="y", pch=1)

plot of chunk unnamed-chunk-1

Practical demonstration in R (2)

Standardize the data and find new basis maximizing the variance:

std <- scale(xy) # same as x' <- (x-mean(x))/sd(x); y' <- (y-mean(y))/sd(y)
plot( std[ sample(nrow(std),10000), ], xlab="x'", ylab="y'", pch=1)
v <- eigen( cov(std) )$vectors
abline(0, v[1,2]/v[1,1], col='red'); abline(0, v[2,2]/v[2,1], col='red');

plot of chunk unnamed-chunk-2

Practical demonstration in R (3)

Calculate the scores manually:

std[1,] %*% eigen( cov(std) )$vectors

          [,1]      [,2]
[1,] -1.323191 0.3167514

and see that these scores are identical (up to a sign) to what is returned by the prcomp:

prcomp(xy,retx=T,scale=T)$x[1,]

       PC1        PC2 
-1.3231911  0.3167514

Summary

It is more practical to find eigenvectors of correlation rather than covariance matrix. In the first case sampling errors cancel precisely on the diagonal and reduced for the off-diagonal elements. If sample is sufficiently big the difference is of course negligible.

Some useful links:

http://www.cs.princeton.edu/courses/archive/spring12/cos598C/svdchapter.pdf http://statmath.wu.ac.at/~hornik/QFS1/principal_component-vignette.pdf