PCA- Math foundation

10 min readDec 25, 2020

PCA- Intuition behind the Math- The unanswered

PCA is by far the most asked questions in the Data Science interviews I have attended.

Sometimes I feel it is not even to test the candidate but rather to get a better understanding of the concept themselves :)

There are hundreds of articles out there on PCA, then why another one!!!

Other articles expect you to know the basic matrix calculations but this focuses on getting hold of them, we will together understand the intuition behind the math. So no worries even if you have a shaky foundation.

While going through the articles on PCA, there could be many questions that might have boggled your mind as in:

What is an eigenvector?
Why is the determinant assumed to be zero while solving the equation Ax=λx
What does an eigenvector of the Covariance matrix signify?

We all know that PCA is used for dimensionality reduction. The expectation is also that you referred to multiple articles to understand PCA and found the concept particularly elusive, if you have not then that’s fine too.

I am gonna help you understand the PCA. We will be dealing with it in two parts.

Part 1- Focuses on the math foundation

Part 2- Connects the foundation to PCA

Note: I urge you to watch 3BLUE1BROWN videos. They have done a spectacular job of simplifying the complex concepts through visuals. Consider the current article as a digest from what I have learned there

Without any further ado, let’s jump into the math:

Part 1

We will be primarily focusing on below three concepts.

Concept 1: Linear transformation representation in matrix language

Concept 2: What does the determinant of a matrix represent?

Concept 3: Eigenvalues and Eigenvectors

Linear transformation representation in matrix language:

Visualizing linear transformation of 2D Space:

The transformed grid in blue, Original grid in grey

Properties of Linear Transformation:

All lines remain lines- post-transformation
The origin should remain fixed

Now, let us see, how can we describe the above linear transformation numerically.

Original Space:

Here i,j are called basis vectors(in other words reference vectors, any vector in the space can be represented/understood using basis vectors). In the above example the vector represented by the yellow line can be understood as -1 part i plus 2 parts of j.

What is the significance of basis vectors?

If we understand where the basis vector i and j have landed post-transformation, we can calculate how any vector in the original space is affected by the transformation.

Now let us see how this vector(-1,2) looks in transformed space.

Transformed Space:

Say some transformation was applied on 2D Space, Focus on where i-hat and j-hat have landed. That is (1,-2) and (3,0) coordinates.

Now since we have understood where i and j have landed post-transformation we can deduce where (-1,2) will land on transformed space i.e (5,2)

All we have to do is substitute i and j with transformed i and j as shown in the below figure.

There is an important hidden lesson to be learned here.

A 2D Linear Transformation can be completely described by 2*2 matrix which essentially comprises information where the basis vector i and j lands post-transformation.

The above transformation can be described in matrix form as

Viola Concept 1 Done !! Not too difficult right.

What does the determinant of a matrix represent:

Now that we know how to represent linear transformation visually and in matrix form, we can understand that a linear transformation sometimes stretches the space or sometimes might squish the space.

One interesting thing we can focus on here is how much the linear transformation is stretching or squishing the space. In other words by how much the area is scaled. This is what determinant intuitively means.

Original space: Area of a single square formed by the basis vectors in original space(A).

Transformed space:

Let us say we are transforming the space and the transformation is represented by the below matrix which basically holds information of where i hat and j hat lands post-transformation

Area of the single parallelogram formed by the basis vectors in transformed space. Area of parallelogram is base * height= 3*2 = 6

This special scaling factor, the factor by which a linear transformation changes any area is called determinant of a transformation.

What does a determinant of 2D transformation zero mean?

It means the area is zero which in turn means the entire space has been squished onto a single line/ point.

Understanding this is very important.

Checking the determinant of a given matrix is zero will give away computing whether or not the given transformation associated with the matrix squishes everything into a smaller dimension, in other words zero area(it can be a line or a point).

What does determinant of 3D transformation mean… by now you might have guessed-

It tells how the volume is scaled due to transformation

In 3D transformation what does a zero determinant mean?

It means the volume is zero which in turn means the entire space is squished into a plane or a line or in most extreme cases a point.

Eigenvalues and Eigenvectors:

Now that we know how to represent a transformation as a 2*2 matrix.

Consider another transformation where new basis vectors are as follows:

transformed i = (3,0)

transformed j =(1,2)

2*2 transformation matrix can be written as:

When this transformation is applied to the original space most vectors get knocked off their span(span is obtained by stretching the vector without changing the slope to eternity).

Pink line above represents the span of a vector(v) in original space and the yellow line represents the same vector in transformed space.

There are some sneaky vectors whose span does not change post-transformation like they just get scaled by a factor but the span/slope does not change. For Ex:(-1,1)

Calculation:

-1*(transformed i)+ 1(transformed j) gives (-2,2)

The vector (-1,1) of original space which as usually represented by grey grid — just got stretched by a factor of 2 without just changing the span/slope because of transformation

These special vectors whose span does not change with respect to transformation are called eigenvectors and the scale by which they are stretched/squished are called eigenvalues.

How to calculate the eigenvector?

Just simply go by the definition, it can be formulated as

‘A’ can be considered as 2D matrix indicating that transformation which does not change the span i.e v but rather scales it by factor (eigenvalue)

The above equation can be re-written as :

If the transformation on a vector is resulting in a zero vector means that the entire space has been squished onto a line.

Before transformation After transformation

In the above diagram, you can see the entire blue grid space is squished to zero meaning the yellow line in the blue grid also got squished to nothing along with the space post-transformation.

From concept 2, we know that area zero means the determinant is zero.

Solving the above equation det((A- lambda*I)=0 gives the eigenvalue

Calculations:

Once you get the value of lambda(eigenvalue) you can calculate eigenvector by substituting the eigenvalue in the formula- (A- lambda*I)v=0

Now that we are prepped up with basics. Let us take a break before we jump in to connect the dots!!

Part 2:

Welcome Back !!

Principal Component Analysis:

Why and How?

Principal component analysis aids in dimensionality reduction. In other words, we try to find a compact representation of data without losing much information.

Consider a dataset with the below 5 dimensions:

Number of skidding accidents, Number of burst water pipes, School closures, Snowplow expenditures, patients with heat stroke, etc.

In reality, temperature could be one dimension that can explain the occurrence of most of these things.

I mean, think about it — At Lower temperature skidding accidents are more due to ice, patients with heat strokes are less almost zero, pipes might burst, Schools might be closed, and obviously snow-plow expenditures are high due to the accumulation of snow.

PCA is important because when you are modeling, the model is actually looking at all the five dimensions instead it should be focussing on the hidden dimension which is temperature because that is gonna explain things better and simplifies the problem to a greater extent. Other perks of viewing data in lower dimensions are less storage and it is computationally less expensive.

PCA makes use of correlations in the data and checks whether the data can be represented better in a lesser number of hidden dimensions-which are basically linear combinations of original dimensions.

Step 1: Centering the Data to (0,0) Mean

Consider a dataset with two dimensions X1 and X2

Once you have the center(average) of the data you subtract it from each of the data points to shift the entire data to (0,0) mean.

Note: Shifting the data does not change how data points are positioned relative to each other, it is just to simplify calculations.

Step 2: Calculate Variance Covariance Matrix

What does covariance mean?

Covariance reflects the relationship between any two dimensions (say x1 and x2).

In layman's terms, How these two dimensions move with respect to each other or how the variability of one dimension affects the other.

For example: Does house price(x1) increase with the number of bedrooms(x2)

Or does it decrease?

Formula for Covariance:

Using the formula, we can come with covariance matrix

Post calculation on a dataset, say the result is as follows :

Now here comes the important question, What is the point of Calculating the Covariance matrix?

This particular matrix has an interesting property, Consider any vector say(-1,1) when it is repeatedly multiplied with a Covariance matrix the slope of the resultant vector converges.

Please look into below calculations reflecting the same

Elaboration of the calculation(simple matrix multiplication):

Multiplying covariance matrix with vector (-1,1)

(2.0 * -1 + 0.8 * 1)= -1.2

(0.8 * -1 + 0.6 * 1)=-0.2

Multiply the above output again with covariance matrix

(2*-1.2+0.8*-0.2) = -2.5(appx)

(0.8*-1.2+0.6*-0.20 = -1(appx)

Hope you can continue with the rest of the calculations, as shown in the above figure, what is interesting to observe is the slope getting converged to 0.454.

Also observe that the slope of the resultant vectors, when multiplied with Covariance Matrix, is turning towards the direction of variance. First, it takes a big turn, then it takes a smaller turn and slowly the slope converges as shown in the figure and calculations.

It is this interesting property of the Covariance matrix that makes it special.

Now Quiz time:

Slope converges to a final vector -> Slope becomes constant-> Meaning slope does not change by multiplying further with Covariance matrix.

What do you call a vector whose slope(in other words span) does not change post some transformation? Yes, you guessed it right that is the eigenvector.

Using the eigenvector concept and special property of covariance matrix we figured out the dimension where maximum variance in the data can be captured.

For a better understanding of PCA please also watch youtube lectures by Victor Lavrenko(https://youtu.be/fKivxsVlycs). Some of the key content here is taken from those lectures. Also, watch statquest videos.

Written by Aalaya Seshadri