Isolation Forest in Python using Scikit learn
Hey all! Today we are going to discuss one of the newest techniques for fraud detection, known as Isolation Forest. This algorithm is quite useful and a lot different from all existing models. So let’s start learning Isolation Forest in Python using Scikit learn.
Isolation forest technique builds a model with a small number of trees, with small sub-samples of the fixed size of a data set, irrespective of the size of the dataset.
The way isolation algorithm works is that it constructs the separation of outliers by first creating isolation trees or random decision trees. Later anomaly score is being calculated as a path length to segregate the outliers and normal observations.
Let’s start coding using isolation algorithm in Python.
The dataset we use here contains transactions form a credit card. Column ‘Class’ takes value ‘1’ in case of fraud and ‘0’ for a valid case.
Download dataset required for the following code.
This is going to be an example of fraud detection with Isolation Forest in Python with Sci-kit learn.
Example of fraud detection with Isolation Forest
Let’s import all required libraries and packages.
import pandas as pd import sklearn from sklearn.metrics import accuracy_score from sklearn.ensemble import IsolationForest from sklearn.model_selection import train_test_split
Reading dataset to our program in .csv format.
dt = pd.read_csv("creditcard.csv")
Generate train and test data.
Valid = dt[dt.Class==0] Valid = Valid.drop(['Class'], axis=1) Fraud = dt[dt.Class==1] Fraud = Fraud.drop(['Class'], axis=1) Valid_train, Valid_test = train_test_split(Valid, test_size=0.30, random_state=42)
Model prediction: Now, we start building the model. Isolation forest algorithm is being used on this dataset.
dt1= IsolationForest(behaviour= 'new', n_estimators=100, random_state=state)
Fit the model and perform predictions using test data.
model = IsolationForest(behaviour = 'new') model.fit(Valid_train) Valid_pred = model.predict(Valid_test) Fraud_pred = model.predict(Fraud_test)
Finally, its time to get the accuracy score, in order to detect valid and Fraud cases.
print("Valid cases accuracy:", list(Valid_pred).count(1)/Valid_pred.shape) print("Fraud Cases accuracy:", list(Fraud_pred).count(-1)/Fraud_pred.shape)
Valid cases accuracy: 0.89568
Fraud Cases accuracy: 0.100
- Isolation forest has an 89.56% of accuracy in detecting out the Valid cases out of the dataset.
- We can also improve the accuracy by varying the size of train & test data or use deep learning algorithms.
You may also read,