Implementation of Random Forest for classification in python

In the previous tutorial, I have discussed intuition behind the Random Forest algorithm. Before going through this post, you must be acquainted behind random forest. In this post, I will discuss the implementation of random forest in python for classification. Classification is performed when we have to classify the unknown item into a class, generally yes or no, or can be something else. We have other algorithms like logistic regression, decision tree, etc but among them, the random forest is the best.

Here is the link to the data set I have used – Social_Network_Ads.CSV

You may also be interested in learning: Random forest for regression and its implementation

Implementation of Random forest for classification

Here are the steps, you can follow to run the algorithm to perform classification. I will also give you an example to have a better understanding of how you can write the code. Here is the link

  • First of all, import the necessary libraries.
     import numpy as np
     import matplotlib.pyplot as plt
     import pandas as pd
  • Now import the data set.
     dataset = pd.read_csv('Social_Network_Ads.csv')

This is what the data set looks like.

Random forest for classification in Python

  • After you have imported the data set, first of all, go through the data set thoroughly and take only necessary columns in your data set.
     X = dataset.iloc[:, [2, 3]].values
     y = dataset.iloc[:, 4].values
  • Now split your data set into training and testing data set. Optimal splitting ratio is 7:3, 8:2, so you can choose any one of them as per your choice.
     from sklearn.cross_validation import train_test_split
     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

y_train in random forest classification in Python

X_train in random forest classification in Python

  • This step is to feature scale your data. One feature having values in range 1000-20000 don’t dominate feature having values in range 1-100. To ensure this feature scaling is done.
     from sklearn.preprocessing import StandardScaler
     sc = StandardScaler()
     X_train = sc.fit_transform(X_train)
     X_test = sc.transform(X_test)
  • Now comes the main task i.e. fitting the classifier to the training set. In this, you have to first import required library. Go through the documentation of function randomForestClassifier and understand the meaning and usage of each parameter. Here, for ex- I have used no.of estimators to be 10, you can use more or less as per your requirement. Next, I have used criteria to be “entropy”.
    from sklearn.ensemble import RandomForestClassifier
    classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
    classifier.fit(X_train, y_train)
  • Now apply the model on test set and predict the test set results.
    y_pred = classifier.predict(X_test)
  • To evaluate the performance of your model, there are several available metrics like auc , ROC curve , confusion matrix etc. I have used confusion matrix here.
    from sklearn.metrics import confusion_matrix
    cm = confusion_matrix(y_test, y_pred)
Confusion matrix random forest classification in Python

Confusion Matrix is used

Visualizing the output – Random Forest Classification in Python

  • Although it ends here. Now we will visualize the test and train set results.
    from matplotlib.colors import ListedColormap
    X_set, y_set = X_train, y_train
    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
    np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
    alpha = 0.75, cmap = ListedColormap(('red', 'green')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
    c = ListedColormap(('red', 'green'))(i), label = j)
    plt.title('Random Forest Classification (Training set)')
    plt.xlabel('Age')
    plt.ylabel('Estimated Salary')
    plt.legend()
    plt.show()
Visualize the train set data of random forest classification

Visualization of train set result

    # Visualising the Test set results
    from matplotlib.colors import ListedColormap
    X_set, y_set = X_test, y_test
    X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
    np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
    alpha = 0.75, cmap = ListedColormap(('red', 'green')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
    plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
    c = ListedColormap(('red', 'green'))(i), label = j)
    plt.title('Random Forest Classification (Test set)')
    plt.xlabel('Age')
    plt.ylabel('Estimated Salary')
    plt.legend()
    plt.show()
Visualization of test set result

Visualization of Test set result

 

Feel free to post your doubts in comments.

You can also give a read to,

https://codespeedy.com/understanding-support-vector-machine-svm/

In the next tutorial, I will be discussing implementation for regression using random forest.

Leave a Reply