Introduction to Xgboost algorithm

In this tutorial, we are going to discuss a very important algorithm that is in much demand nowadays. This algorithm is none other than Xgboost algorithm. This algorithm is nowadays gaining much attention. This tutorial is going to be an introduction to xgboost algorithm.

Introduction to Xgboost algorithm :

In almost every competition won, I have noticed that out of 30, 24 persons have used this xgboost algorithm in his model. We often spend most of our time in increasing the accuracy of our model. But this can be done by using better algorithms like this Xgboost one.

Xgboost is the short form for Extreme Gradient Boosting. It implements the gradient boosted decision tree algorithm. If you are not aware of this, please give a read to gradient boosted decision tree . You can download xgboost library on your machine to work with it. Results show it is faster than most of the other algorithms. You can work with xgboost in R, Python, Java , C++ , etc.

Prerequisite of performing xgboost is to have vectorised data and that too numeric one. In brief, I would like to say instead of using prediction by one decision tree, it uses predictions by several decision trees. Xgboost uses leaf-wise growth strategy when growing the decision trees. This means it splits the tree which is minimizing the loss function the most. Although this strategy can make the model susceptible to overfitting but is better. The reason for their much-achieved success is that these can be implemented without having much knowledge of their internals.

Parameters in Xgboost function :

These are the most commonly used parameters that you must be acquainted with :

  1. eta – The default value for eta is 0.3 . After each step is completed, we can get complete information regarding the weights of all features. Lesser is the value for eta, more robust is the model to overfitting.
  2. gamma – The default value for gamma is 0. This value specifies the value for minimum reduction loss that the model must achieve to split the parent node into child nodes in decision tree.
  3. max_depth – This parameter specifies the maximum depth of the tree. The default value is 6. It can take values from 1 to infinity.
  4. subsample – This parameter specifies the ratio for training set. In other words, it defines the percentage of data that is to be taken randomly for training. The default value is 1.
  5. lambda and alpha – The default value for lambda is 1 and that for alpha is 0.
  6. eval_metric – This parameter specifies the evaluation metric that you want to use as per your problem statement.
  7. booster – This parameter takes value as “gbtree” ( tree based implementation ) or “gblinear” ( linear function implementation ).

Advantages :

Following advantages make it more demanding :

  1. It is fast
  2. It is accurate.
  3. Easy to implement softwares are available.
  4. It has an additional feature of finding the most important features.
  5. It has the capacity of doing parallel computations using all available CPU cores.

Disadvantages :

  1. It is sometimes slow in implementation.
  2. It is susceptible to overfitting.

With this, I end this post here. Feel free to post your doubts here.

You can also read Natural Language Processing and its implementation in Python

Leave a Reply

Your email address will not be published. Required fields are marked *