Implementation of PCA reduction in Python

In the last tutorial, I have given a brief introduction and intuition regarding Principal component analysis. If you haven’t read that post, then please go through that post before going through this post. This post will focus on implementation of PCA reduction in Python.

Link to the data set that I have used is Wine.csv

Implementation of PCA reduction :

  • The first step is to import all the necessary Python libraries.
    import numpy as np
    import matplotlib.pyplot as plt
    import pandas as pd
  • Import the data set after importing the libraries.
    data = pd.read_csv('Wine.csv')
  • Take the complete data because the core task is only to apply PCA reduction to reduce the number of features taken.
    A = data.iloc[:, 0:13].values
    B = data.iloc[:, 13].values

PCA reduction with dataset in Python

 




PCA reduction

  • Split the data set into training and testing data set. Below is our Python code to do this task:
    from sklearn.model_selection import train_test_split
    A_train, A_test, B_train, B_test = train_test_split(A, B, test_size = 0.3)
  • Now comes an important step of feature scaling so that the model is not biased towards any specific feature.
    from sklearn.preprocessing import StandardScaler
    sc = StandardScaler()
    A_train = sc.fit_transform(A_train)
    B_test = sc.transform(A_test)

PCA reduction X_Train Python

 

PCA reduction X_Test Python

  • Now we will apply PCA technique. First, import PCA library and then fit the data into this. Tune the parameters as per the need of your project.
    from sklearn.decomposition import PCA
    pca = PCA(n_components = 2)
    A_train = pca.fit_transform(A_train)
    A_test = pca.transform(A_test)
    explained_variance = pca.explained_variance_ratio_

explained_variance PCA

  • Now when you have appropriate features. Now you can apply a suitable algorithm to get good accuracy. For example, I have used logistic regression algorithm in my model.
    from sklearn.linear_model import LogisticRegression
    classifier = LogisticRegression(random_state = 0)
    classifier.fit(A_train, B_train)
  • Next step is to predict the results by using the testing set.
    B_pred = classifier.predict(A_test)

y_predict PCA

  • Use any metric to evaluate your performance. For example, I have used the confusion matrix here in this program.
    from sklearn.metrics import confusion_matrix
    conf_matrix = confusion_matrix(B_test, B_pred)

confusion matrix for PCA

Visualizing the results :

Here I will be visualizing the results that have been the outcome of the model we have created. PCA reduction has been applied.

       Visualizing training set results

    from matplotlib.colors import ListedColormap
    A_set, B_set = A_train, B_train
    X1, X2 = np.meshgrid(np.arange(start = A_set[:, 0].min() - 1, stop = A_set[:, 0].max() + 1, step = 0.01),
    np.arange(start = A_set[:, 1].min() - 1, stop = A_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(A1, A2, classifier.predict(np.array([A1.ravel(), A2.ravel()]).T).reshape(A1.shape),
    alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
    plt.xlim(A1.min(), A1.max())
    plt.ylim(A2.min(), A2.max())
    for i, j in enumerate(np.unique(B_set)):
    plt.scatter(A_set[y_set == j, 0], A_set[y_set == j, 1],
    c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
    plt.title('Logistic Regression (Training set)')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.legend()
    plt.show()

     Visualizing test set results : 

    from matplotlib.colors import ListedColormap
    A_set, B_set = A_test, B_test
    A1, A2 = np.meshgrid(np.arange(start = A_set[:, 0].min() - 1, stop = A_set[:, 0].max() + 1, step = 0.01),
    np.arange(start = A_set[:, 1].min() - 1, stop = A_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(A1, X2, classifier.predict(np.array([A1.ravel(), A2.ravel()]).T).reshape(A1.shape),
    alpha = 0.75, cmap = ListedColormap(('red', 'green', 'blue')))
    plt.xlim(A1.min(), A1.max())
    plt.ylim(A2.min(), A2.max())
    for i, j in enumerate(np.unique(B_set)):
    plt.scatter(A_set[y_set == j, 0], A_set[y_set == j, 1],
    c = ListedColormap(('red', 'green', 'blue'))(i), label = j)
    plt.title('Logistic Regression (Test set)')
    plt.xlabel('PC1')
    plt.ylabel('PC2')
    plt.legend()
    plt.show()

 

With this, I would like to end this post here. Feel free to ask your doubts here.

Also, give a read to Random forest for regression and its implementation.


Leave a Reply

Your email address will not be published. Required fields are marked *