Isolation Forest in Python using Scikit learn

Hey all! Today we are going to discuss one of the newest techniques for fraud detection, known as Isolation Forest. This algorithm is quite useful and a lot different from all existing models. So let’s start learning Isolation Forest in Python using Scikit learn.
Isolation forest technique builds a model with a small number of trees, with small sub-samples of the fixed size of a data set, irrespective of the size of the dataset.

The way isolation algorithm works is that it constructs the separation of outliers by first creating isolation trees or random decision trees. Later anomaly score is being calculated as a path length to segregate the outliers and normal observations.

Let’s start coding using isolation algorithm in Python.

IsolationForest example

The dataset we use here contains transactions form a credit card. Column ‘Class’ takes value ‘1’ in case of fraud and ‘0’ for a valid case.

Download dataset required for the following code.

This is going to be an example of fraud detection with Isolation Forest in Python with Sci-kit learn.

Example of fraud detection with Isolation Forest

Let’s import all required libraries and packages.

import pandas as pd
import sklearn
from sklearn.metrics import accuracy_score
from sklearn.ensemble import IsolationForest
from sklearn.model_selection import train_test_split

Reading dataset to our program in .csv format.

dt = pd.read_csv("creditcard.csv")

Generate train and test data.

Valid = dt[dt.Class==0]
Valid = Valid.drop(['Class'], axis=1)
Fraud = dt[dt.Class==1]
Fraud = Fraud.drop(['Class'], axis=1)
Valid_train, Valid_test = train_test_split(Valid, test_size=0.30, random_state=42)

Model prediction:  Now, we start building the model. Isolation forest algorithm is being used on this dataset.

dt1= IsolationForest(behaviour= 'new', n_estimators=100, random_state=state)

Fit the model and perform predictions using test data.

model = IsolationForest(behaviour = 'new')
model.fit(Valid_train)
Valid_pred = model.predict(Valid_test)
Fraud_pred = model.predict(Fraud_test)

Finally, its time to get the accuracy score, in order to detect valid and Fraud cases.

print("Valid cases accuracy:", list(Valid_pred).count(1)/Valid_pred.shape[0])
print("Fraud Cases accuracy:", list(Fraud_pred).count(-1)/Fraud_pred.shape[0])

Output

Valid cases accuracy: 0.89568
Fraud Cases accuracy: 0.100

 

Observations :

  • Isolation forest has an 89.56% of accuracy in detecting out the Valid cases out of the dataset.
  • We can also improve the accuracy by varying the size of train & test data or use deep learning algorithms.

You may also read,

Why Python Is The Most Popular Language For Machine Learning

Leave a Reply

Your email address will not be published. Required fields are marked *