Implementation of Random Forest for classification in python
In the previous tutorial, I have discussed intuition behind the Random Forest algorithm. Before going through this post, you must be acquainted behind random forest. In this post, I will discuss the implementation of random forest in python for classification. Classification is performed when we have to classify the unknown item into a class, generally yes or no, or can be something else. We have other algorithms like logistic regression, decision tree, etc but among them, the random forest is the best.
Here is the link to the data set I have used – Social_Network_Ads.CSV
You may also be interested in learning: Random forest for regression and its implementation
Implementation of Random forest for classification
Here are the steps, you can follow to run the algorithm to perform classification. I will also give you an example to have a better understanding of how you can write the code. Here is the link
- First of all, import the necessary libraries.
import numpy as np import matplotlib.pyplot as plt import pandas as pd
- Now import the data set.
dataset = pd.read_csv('Social_Network_Ads.csv')
This is what the data set looks like.
- After you have imported the data set, first of all, go through the data set thoroughly and take only necessary columns in your data set.
X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values
- Now split your data set into training and testing data set. Optimal splitting ratio is 7:3, 8:2, so you can choose any one of them as per your choice.
from sklearn.cross_validation import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
- This step is to feature scale your data. One feature having values in range 1000-20000 don’t dominate feature having values in range 1-100. To ensure this feature scaling is done.
from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
- Now comes the main task i.e. fitting the classifier to the training set. In this, you have to first import required library. Go through the documentation of function randomForestClassifier and understand the meaning and usage of each parameter. Here, for ex- I have used no.of estimators to be 10, you can use more or less as per your requirement. Next, I have used criteria to be “entropy”.
from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0) classifier.fit(X_train, y_train)
- Now apply the model on test set and predict the test set results.
y_pred = classifier.predict(X_test)
- To evaluate the performance of your model, there are several available metrics like auc , ROC curve , confusion matrix etc. I have used confusion matrix here.
from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred)

Confusion Matrix is used
Visualizing the output – Random Forest Classification in Python
- Although it ends here. Now we will visualize the test and train set results.
from matplotlib.colors import ListedColormap X_set, y_set = X_train, y_train X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('Random Forest Classification (Training set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()

Visualization of train set result
# Visualising the Test set results from matplotlib.colors import ListedColormap X_set, y_set = X_test, y_test X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01), np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01)) plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape), alpha = 0.75, cmap = ListedColormap(('red', 'green'))) plt.xlim(X1.min(), X1.max()) plt.ylim(X2.min(), X2.max()) for i, j in enumerate(np.unique(y_set)): plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j) plt.title('Random Forest Classification (Test set)') plt.xlabel('Age') plt.ylabel('Estimated Salary') plt.legend() plt.show()

Visualization of Test set result
Feel free to post your doubts in comments.
You can also give a read to,
https://www.codespeedy.com/understanding-support-vector-machine-svm/
In the next tutorial, I will be discussing implementation for regression using random forest.
Leave a Reply