Intrusion Detection model using Machine Learning algorithm in Python

The internet is the world’s marketplace. For any business to be eminent, a computer network is certainly going to be necessary. Connecting your business to the internet greatly expands its reach, value, and effectiveness. However, when you connect your business to a network, security becomes a critical concern as your data becomes prone to attack by malicious users. This is where an Intrusion Detection System(IDS) proves to be highly beneficial. This article will help you understand how Machine Learning algorithms can help increase the effectiveness of an Intrusion Detection System. Before directly implementing an ML algorithm to build an IDS, let us first try to understand what is meant by an Intrusion Detection System.
What is an IDS?
An Intrusion Detection System (IDS) can be a device or a software application that works with your network to keep it secure and notifies you when somebody tries to break into your system. It monitors network traffic to search for suspicious activities and known threats.
Types of IDS
There is a wide variety of IDS available nowadays. The most common classification include:
- Network Intrusion Detection Systems (NIDS)
- Host-based Intrusion Detection Systems (HIDS)
- Signature-based Intrusion Detection Systems
- Anomaly-based Intrusion Detection Systems
To learn more about Intrusion Detection Systems, refer to barracuda.com.
IDS using Machine Learning
Machine Learning is the field of study that gives computers the ability to learn from experience and improve without being explicitly programmed. ML algorithms can be classified into three main categories namely:
- Supervised machine learning algorithms:
Here, the training data is labeled i.e. a part of the data already has the correct answer. - Unsupervised machine learning algorithms:
Here, the training data is unlabeled. The algorithm on its own tries to identify certain patterns or clusters in the data. - Semi-supervised machine learning algorithms:
Here, a part of the data is labeled but most of it is unlabeled and a combination of supervised and unsupervised algorithms can be applied.
Unsupervised machine learning algorithms can learn the standard pattern of the network and report suspicious activities on its own without requiring a labeled dataset. They have the ability to detect new types of intrusions but however, are highly prone to false-positive alarms. To reduce the number of false-positives, we use supervised machine learning algorithms as they efficiently handle the known attacks as well as can recognize variations of those attacks.
Building an IDS using Machine Learning
Dataset
Here, we will implement an Intrusion Detection model using one of the supervised ML algorithms. The dataset used is the KDD Cup 1999 Computer network intrusion detection dataset. It has a total of 42 features including the target variable named label. The target variable has 23 classes/categories in it where each class is a type of attack.
CLASS NAME NUMBER OF INSTANCES ————————————————————————————————————— smurf 280790 neptune 107201 normal 97277 back 2203 satan 1589 ipsweep 1247 portsweep 1040 warezclient 1020 teardrop 979 pod 264 nmap 231 guess_passwd 53 buffer_overflow 30 land 21 warezmaster 20 imap 12 rootkit 10 loadmodule 9 ftp_write 8 multihop 7 phf 4 perl 3 spy 2
Code
We first initialize ‘X’ the set of independent variables (features) and ‘y’ the target variable. The dataset is then split into training and test sets. The test set i.e. data that the model will not see during the training phase will help us to calculate the model accuracy.
# Importing the required libraries import pandas as pd import numpy as np # Importing the KDCup99 dataset dataset = pd.read_csv('KDDCup99.csv') X = dataset.iloc[:, :-1].values y = dataset.iloc[:, 41:42].values # Spliting the dataset into training and test sets from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
OneHotEncoding is then applied to the categorical columns of ‘X’ using ColumnTransformer and OneHotEncoder from the sci-kit-learn library. As the values of the target variable ‘y’ are of type string, we apply LabelEncoding to it to assign an integer value to each category of ‘y’.
''' Data Preprocessing ''' # Applying ColumnTransformer to the categorical columns of X_train and X_test from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [1, 2, 3])], remainder = 'passthrough') X_train = ct.fit_transform(X_train) X_test = ct.transform(X_test) # encoding y_train and y_test from sklearn.preprocessing import LabelEncoder le_y = LabelEncoder() y_train[:, 0] = le_y.fit_transform(y_train[:, 0]) y_test[:, 0] = le_y.transform(y_test[:, 0]) y_train = y_train.astype(int) y_test = y_test.astype(int)
We implement a RandomForestClassifier which is a form of ensemble learning where a number of decision trees are combined. It aggregates the votes of its decision trees to decide the final class of the test object.
# Implementing RandomForest Classifier from sklearn.ensemble import RandomForestClassifier classifier = RandomForestClassifier(n_estimators = 150, n_jobs = -1) classifier.fit(X_train, y_train)
Finally, the classifier predicts the results of the test set. To evaluate the model accuracy, we obtain the confusion matrix whose sum of the diagonal elements is the total number of correct predictions.
TOTAL NO. OF CORRECT PREDICTIONS (SUM OF THE DIAGONAL ELEMENTS OF THE CONFUSION MATRIX) ACCURACY(%) = ——————————————————————————————————————————————————————————————————————————————————————————— X 100 TOTAL NO. OF PREDICTIONS (SUM OF ALL THE ELEMENTS OF THE CONFUSION MATRIX)
# Making predictions on the test set y_pred = classifier.predict(X_test) # Evaluating the predicted results from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred) # Calculating the total correct predictions total_correct_predictions = 0 for i in range(len(cm)): total_correct_predictions+= cm[i][i] # Calculating the model accuracy accuracy = ( total_correct_predictions / np.sum(cm))*100 print(f'Acuuracy obtained on this test set : {accuracy:.2f} %')
OUTPUT :
Acuuracy obtained on this test set : 99.98 %
You can try improving the performance of the model through hyperparameter tuning using GridSearchCV. In conclusion, Machine Learning algorithms can prove to be highly beneficial when it comes to increasing the efficiency of your IDS and any company failing to adapt to these new methods is at high risk of compromising their system security.
Also read :
- How to Improve Accuracy Of Machine Learning Model in Python
- Predicting video game sales using Machine Learning in Python
- How to choose number of epochs to train a neural network in Keras
I find this article very educative. I am currently gathering materials for a thesis around the same topic IDS. I understand how IDS work, but I lack the coding skills to model and determine a more efficient IDS.
I will appreciate any help in this direction.