Introduction to Dimension Reduction – Principal Component Analysis

In this tutorial, we will go through with one of the most important concepts in machine learning called Dimension Reduction – Principal Component Analysis (Also known as PCA ). So now let’s get straight into this concept.

What is Dimension Reduction?

We all are acquainted with machine learning models. These are built when we fit our data set into a suitable algorithm. It might happen sometime that there are hundreds of features in the model that we create. It may seem that all the features are important but this is not the case always. So, here the need comes of the thing I was talking about DIMENSION REDUCTION. We require to reduce and use only the most relevant features. Many irrelevant features increase the computation time and decrease the accuracy of the model. So, let me define what this concept really is. It is a procedure that is to reduce the number of vast features or dimensions into less number of dimension. This is in a way the model describes the important information concisely as before.

There are several techniques of dimension reduction like :

  1. Correlation between the features
  2. Random Forest
  3. Decision tree
  4. Backward Feature Elimination
  5. Low Variance
  6. Principal component analysis(PCA)  and many more.

So, let’s move straight into the method called PCA and gain some knowledge about it.

Principal Component Analysis

In PCA, the original set of features is converted into a new set of features. These new features are a linear combination of original features. This is called Principal Component. We create this set in a way that first variable in the set is responsible for most of the variation and then the second and so on.

These principal components are sensitive to change in measurement scale. So, before doing the principal component analysis you must do the feature scaling of features.

Step by Step guide to perform PCA

  • First of all, comes feature scaling or normalization –

This is done so that model does not get bias towards some specific features. In simple words, feature scaling means to scale the features so that they have equal contribution towards the model. Since PCA is sensitive to the scale of measurement that also specifies the need for feature scaling. This feature scaling does not affect categorical variables but changes the normal variable’s values by a significant value.

  • Computing the covariance matrix –

This matrix basically tells us whether there is any relationship between different variables or not. Values in this matrix represent how the variables are varying from the mean with respect to each other. We need to build this covariance matrix. By this, we come to know about closely related variables and redundant variables. This matrix is n x n matrix where n is the number of total features in the input data set. This matrix is commutative in nature. Now if the value in the covariance matrix is positive then variables have a positive correlation. If the value is negative, it signifies the negative relationship between the variables.

  • Compute eigenvectors and eigenmatrix –

To compute the principal components, we require eigenvectors and matrices. Since principal components are linear combination of original features, we need to have some constant values. These values are eigenvalues and should be such that these new components have no relation. We make Eigenvectors by ordering these values in descending order. What PCA does is, it tries to describe most of the information by first variable and the rest by other variables.

  • Feature Vector –

This vector is basically the matrix with important features as columns. Now, this contributes towards dimension reduction because if we will keep k features, k dimensions will be there.

  • Convert the feature vector – 

The last step is to convert the feature vector back in terms of original features. Multiplying the transpose of original data with the transpose of this feature vector.

With this, I end this post. Post your doubts in the comments section.

Also, give a read to https://www.codespeedy.com/random-forest-for-regression-and-its-implementation/

 

Leave a Reply

Your email address will not be published. Required fields are marked *