Introduction to Random Forest algorithm
In previous tutorials, I have discussed Introduction to Natural Language Processing, Apriori algorithm, Hierarchical clustering algorithm. In this tutorial, we will discuss an algorithm that can be used for both regression and classification: Random Forest. But in the post, I will discuss random forest for classification only. In short, this post will be about “Introduction to random forest algorithm “. Before I move further, I must tell, you must be acquainted with the concept of decision trees.
You may also be interested in learning this:
- Implementation of Random Forest for classification in python
- Random forest for regression and its implementation in Python
Random Forest algorithm
Random Forest algorithm is one of the famous algorithms that come under supervised learning. It is a famous ensemble learning method. As the name suggests Forests, in this algorithm forests, are created using a large number of trees. More the number of trees, more robust is your algorithm. I am considering you all know decision tree algorithm. If you are thinking that this algorithm must be building many decision trees, then that is not the case.
In decision tree, we use information gain, gini index etc to calculate the root node and split the dataset until we are left with the leaf nodes(answer “yes” or “no”). But in a random forest, the process is completely random. Random calculations find the root node and data is split on the basis of this.
Example to understand the algorithm
Although the case is a small one, but you may understand the concept. Kushagra wants to buy shirt for him. Now he is little confused which one to take from Black, Green and Yellow shirts so he asks his friends Kirti, Saransh and Manik for the same. In this case, there are three categories- the three shirts. Here, this case uses both decision tree and random forest.
Decision Tree concept :
One case is that Kushagra asked his best friend. Now, Saransh asked him some questions. On the basis of answers to these questions, Saransh suggested him to buy Yellow shirt. Here, Kushagra’s best friend is the decision tree and the vote(buying a Shirt) is the leaf node of decision tree(target class). Since in this case shirt is decided by only one person, in a technical sense, we can say output is given by one decision tree.
Random Tree concept :
In this case, he took advice from other friends Kirti and Manik. Kirti asked him a few questions. On the basis of answers to these questions, Kirti framed some rules and used them to suggest the shirt. Similarly, others also questioned him and framed some rules to suggest him. Now what Kushagra will do is combine all the suggestions from his friends (forest is built by combining all the trees). If one friend suggests exactly what other suggested, he will simply increase the count. On the basis of maximum votes, he will decide which shirt to buy.
Pseudocode for the algorithm:
- First, we will select “x” random features from the total “y” features.
- Now, we will find the root node.
- Using the best split, we will split our node into two nodes.
- We will perform 1 to 3 until “n” number of trees are created.
- To perform the prediction, we will use our testing data set.
- We will simply use the created set of rules to predict the result and store each output in some variable.
- We will find the votes for each of the predicted result.
- The predicted result with maximum votes will be the final result.
Advantages of Random Forest:
- This algorithm is used for both regression and classification.
- It is better and robust than other algorithms.
- Do not overfit the model and we can apply this algorithm to model with categorical values.
Hope, you have understood the basic of random forest, in further tutorials, I will discuss its implementation in python.
Till then, you can give a read to another article,
Feel free to ask your doubts in comments.