Author: Ari Cooper-Davis
Date Published: Dec 4th 2019
This is the first post in a series where I'll explore principal component analysis (PCA) from the perspective of a Data Scientist. I'll illustrate our journey using snippets from an iPython Notebook that you can download to follow along at home.
I'm a Postgraduate Researcher in Hydroinformatics, and as such I often find myself working with large and complex data-sets. PCA is one of the tools that I can use to help me to understand and simplify the relationships that these data-sets encode, which in turn can facilitate modelling, regressing, and forecasting.
PCA is a transformation that decomposes data into components that describe it in a slightly different way. You don't gain or lose any information with this transform, in fact you can apply it in reverse to recover your data back. The result is a set of components, each a linear combination of two of the features in your data, ordered by the amount of the data's total variance that they explain.
I said that you don't lose any information during the process, and that's true, however once you've identified the components that best explain your data's variance then you may choose to discard the others. You might do this because you wish to reduce the dimensionality of your data without losing the relationships hidden in those dimensions. This can speed up data processing, and facilitate the visualisation of the relationships in your data.
Granted, you'll lose a little variance by only keeping the components that explain most of it. But this could be intentional - those less descriptive components might be causing your model to over-fit, and removing them may help you to avoid this.
How does it work?
PCA works by building up a covariance matrix for your data. If your data is 𝑛 dimensional then this covariance matrix will be 𝑛×𝑛 dimensions. It is then possible to determine 𝑛 linearly independent eigenvectors and eigenvalues from this covariance matrix. Those eigenvectors become your components, and the eigenvectors their magnitude, and therefore the amount of variance that they explain.
The easiest way to understand this intuitively is to see how it is performed, so let's go ahead with a worked example.
This isn't going to be a contrived example; we're going to be using real measured water quality data. This data is open access as part of the EU Water Framework Directive. I'm going to be looking at water quality samples from Coniston Water in The Lake District.
Despite this being a real example, this is also only the first post in the series so whilst PCA is most often used for dimensionality reduction we're going to choose just a couple of features from the dataset for now. Don't worry, we'll get to the reduction in a future post.
So, I've selected a couple of water quality metrics measured monthly in 2018 - Potassium and Sodium concentrations.
Step 1 - Subtract the Mean
My first instinct when looking for relationships between variables would be to plot one against the other and see if I can identify a relationship explaining how they vary.
That looks like a decent linear relationship. Notice the difference in the domains though, Sodium measurements are far higher than Potassium concentrations. To account for this I'm going to subtract the mean from each feature so that the data is centered around zero. This is strictly necessary for PCA, but also helps us to ignore the values of our data and focus instead on how they vary together.