Intrusion Detection model using Machine Learning algorithm in Python

intrusion

The internet is the world’s marketplace. For any business to be eminent, a computer network is certainly going to be necessary. Connecting your business to the internet greatly expands its reach,  value, and effectiveness. However, when you connect your business to a network, security becomes a critical concern as your data becomes prone to attack by malicious users. This is where an Intrusion Detection System(IDS) proves to be highly beneficial. This article will help you understand how Machine Learning algorithms can help increase the effectiveness of an Intrusion Detection System. Before directly implementing an ML algorithm to build an IDS, let us first try to understand what is meant by an Intrusion Detection System.

What is an IDS?

An Intrusion Detection System (IDS) can be a device or a software application that works with your network to keep it secure and notifies you when somebody tries to break into your system. It monitors network traffic to search for suspicious activities and known threats.

Types of IDS

There is a wide variety of IDS available nowadays. The most common classification include:

  • Network Intrusion Detection Systems (NIDS)
  • Host-based Intrusion Detection Systems (HIDS)
  • Signature-based Intrusion Detection Systems
  • Anomaly-based Intrusion Detection Systems

To learn more about Intrusion Detection Systems, refer to barracuda.com.

IDS using Machine Learning

Machine Learning is the field of study that gives computers the ability to learn from experience and improve without being explicitly programmed. ML algorithms can be classified into three main categories namely:

  • Supervised machine learning algorithms:
    Here, the training data is labeled i.e. a part of the data already has the correct answer.
  • Unsupervised machine learning algorithms:
    Here, the training data is unlabeled. The algorithm on its own tries to identify certain patterns or clusters in the data.
  • Semi-supervised machine learning algorithms:
    Here, a part of the data is labeled but most of it is unlabeled and a combination of supervised and unsupervised algorithms can be applied.

Unsupervised machine learning algorithms can learn the standard pattern of the network and report suspicious activities on its own without requiring a labeled dataset. They have the ability to detect new types of intrusions but however, are highly prone to false-positive alarms. To reduce the number of false-positives, we use supervised machine learning algorithms as they efficiently handle the known attacks as well as can recognize variations of those attacks.

Building an IDS using Machine Learning

Dataset

Here, we will implement an Intrusion Detection model using one of the supervised ML algorithms. The dataset used is the KDD Cup 1999 Computer network intrusion detection dataset. It has a total of 42 features including the target variable named label. The target variable has 23 classes/categories in it where each class is a type of attack.

CLASS NAME       NUMBER OF INSTANCES
—————————————————————————————————————
smurf                 280790
neptune               107201
normal                 97277
back                    2203
satan                   1589
ipsweep                 1247
portsweep               1040
warezclient             1020
teardrop                 979
pod                      264
nmap                     231
guess_passwd              53
buffer_overflow           30
land                      21
warezmaster               20
imap                      12
rootkit                   10
loadmodule                 9
ftp_write                  8
multihop                   7
phf                        4
perl                       3
spy                        2

Code

We first initialize ‘X’ the set of independent variables (features) and ‘y’ the target variable. The dataset is then split into training and test sets. The test set i.e. data that the model will not see during the training phase will help us to calculate the model accuracy.

# Importing the required libraries
import pandas as pd
import numpy as np

# Importing the KDCup99 dataset
dataset = pd.read_csv('KDDCup99.csv')

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 41:42].values

# Spliting the dataset into training and test sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

OneHotEncoding is then applied to the categorical columns of ‘X’ using ColumnTransformer and OneHotEncoder from the sci-kit-learn library. As the values of the target variable ‘y’ are of type string, we apply LabelEncoding to it to assign an integer value to each category of ‘y’.

''' Data Preprocessing '''

# Applying ColumnTransformer to the categorical columns of X_train and X_test
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [1, 2, 3])], remainder = 'passthrough')
X_train = ct.fit_transform(X_train)
X_test = ct.transform(X_test)

# encoding y_train and y_test
from sklearn.preprocessing import LabelEncoder
le_y = LabelEncoder()
y_train[:, 0] = le_y.fit_transform(y_train[:, 0])
y_test[:, 0] = le_y.transform(y_test[:, 0])

y_train = y_train.astype(int)
y_test = y_test.astype(int)

We implement a RandomForestClassifier which is a form of ensemble learning where a number of decision trees are combined. It aggregates the votes of its decision trees to decide the final class of the test object.

# Implementing RandomForest Classifier
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators = 150, n_jobs = -1)
classifier.fit(X_train, y_train)

Finally, the classifier predicts the results of the test set. To evaluate the model accuracy, we obtain the confusion matrix whose sum of the diagonal elements is the total number of correct predictions.

               TOTAL NO. OF CORRECT PREDICTIONS (SUM OF THE DIAGONAL ELEMENTS OF THE CONFUSION MATRIX) 
ACCURACY(%) = —————— X 100
                     TOTAL NO. OF PREDICTIONS (SUM OF ALL THE ELEMENTS OF THE CONFUSION MATRIX)

# Making predictions on the test set
y_pred = classifier.predict(X_test)

# Evaluating the predicted results
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)

# Calculating the total correct predictions 

total_correct_predictions = 0
for i in range(len(cm)):
    total_correct_predictions+= cm[i][i]

# Calculating the model accuracy
accuracy = ( total_correct_predictions / np.sum(cm))*100
print(f'Acuuracy obtained on this test set : {accuracy:.2f} %')

OUTPUT :

Acuuracy obtained on this test set : 99.98 %

You can try improving the performance of the model through hyperparameter tuning using GridSearchCV. In conclusion, Machine Learning algorithms can prove to be highly beneficial when it comes to increasing the efficiency of your IDS and any company failing to adapt to these new methods is at high risk of compromising their system security.

Also read :

  1. How to Improve Accuracy Of Machine Learning Model in Python
  2. Predicting video game sales using Machine Learning in Python
  3. How to choose number of epochs to train a neural network in Keras

Leave a Reply