Network Intrusion detection System using Machine Learning

Machine learning is one of the fastest-growing domains in technology and is finding applications in numerous fields. The ability to look for patterns in data using ML-techniques has great scope. One such use is in computer network safety. A great problem in today’s digital age is the presence of hackers, malware, and security threats on the internet. Fortunately, since internet protocols often follow fixed and predictable patterns,  Machine Learning algorithms can detect threats.

In this tutorial, we shall implement a network intrusion detection system on the famous KDD Cup 1999 Dataset in Python programming. This dataset was released as part of a data mining challenge and is openly available on UCI.

Background on data :

The extensive dataset has 495000 records, 41 input features, and 1 target variable, which tells us the status of the network activity. The network activity can be normal (no threat) or it can belong to one of the 22 categories of network attacks. To make things simpler, we will group the attacks into 4 main categories, namely :

  • DOS: denial-of-service, e.g. syn flood
  • R2L: unauthorized access from a remote machine, e.g. guessing password
  • U2R: unauthorized access to local superuser (root) privileges, e.g., various buffer overflow attacks
  • probing: surveillance and other probing, e.g., port scanning.

Initial Set up

We will first read in the data and make our observations. Below we are importing all the required Python libraries.

import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Now, we check read in the data, which can be accessed via a URL link.

url= 'http://kdd.ics.uci.edu/databases/kddcup99/kddcup.data_10_percent.gz'
df= pd.read_csv(url, header=None)
d.head()

As you can see, there are 41 columns with the final one signifying the output to be predicted. Since the dataset doesn’t have the columns labeled beforehand, we have to do that. This information is present on the UCI dataset link.

df.columns= [ 'duration','protocol_type', 'service', 'flag', 'src_bytes','dst_bytes','land','wrong_fragment','urgent','hot','num_failed_logins','logged_in',
'num_compromised', 'root_shell', 'su_attempted', 'num_root', 'num_file_creations', 'num_shells', 'num_access_files', 'num_outbound_cmds',
'is_host_login', 'is_guest_login','cnt','srv_count','serror_rate','srv_serror_rate','rerror_rate','srv_rerror_rate','same_srv_rate',
'diff_srv_rate','srv_diff_host_rate','dst_host_count','dst_host_srv_count','dst_host_same_srv_rate','dst_host_diff_srv_rate','dst_host_same_src_port_rate',
'dst_host_srv_diff_host_rate','dst_host_serror_rate','dst_host_srv_serror_rate','dst_host_rerror_rate','dst_host_srv_rerror_rate','outcome']
print(df.describe())

Preprocessing

First, we see the target variable, outcome.

df['outcome'].unique()
>>> array(['normal.', 'buffer_overflow.', 'loadmodule.', 'perl.', 'neptune.',
       'smurf.', 'guess_passwd.', 'pod.', 'teardrop.', 'portsweep.',
       'ipsweep.', 'land.', 'ftp_write.', 'back.', 'imap.', 'satan.',
       'phf.', 'nmap.', 'multihop.', 'warezmaster.', 'warezclient.',
       'spy.', 'rootkit.'], dtype=object)

As mentioned before, we will convert it into 5 main categories, the 4 attack types listed above, and normal or safe connection.

df=df.replace(to_replace =["ipsweep.","portsweep.","nmap.","satan."],  value ="probe")
df=df.replace(to_replace =["ftp_write.", "guess_passwd.","imap.","multihop.","phf.","spy.", "warezclient.","warezmaster."],  value ="r2l")
df=df.replace(to_replace =["buffer_overflow.","loadmodule.","perl.", "rootkit."],  value ="u2r")
df=df.replace(to_replace =["back.", "land.","neptune.", "pod.","smurf.","teardrop."],value ="dos")


from collections import Counter
Counter(df['outcome'])
>>> Counter({'normal.': 97278,
         'u2r': 52,
         'dos': 391458,
         'r2l': 1126,
         'probe': 4107})

Thus, we now have only 5 output classes as listed above.

You can see that 38 attributes are numeric, while three attributes, service, protocol type, and flag contain string values that need to be converted. We do this with label encoding.

from sklearn import preprocessing 
label_encoder = preprocessing.LabelEncoder()
df['protocol_type']= label_encoder.fit_transform(df['protocol_type'])  
df['service']= label_encoder.fit_transform(df['service'])  
df['flag']= label_encoder.fit_transform(df['flag'])  
df['flag']= label_encoder.fit_transform(df['flag'])  
df['outcome']= label_encoder.fit_transform(df['outcome'])  

Since there is such a large number of features, it is possible that some features are redundant. Let us print a correlation matrix to see this.

corr = df.corr()
plt.figure(figsize=(15,12))
sns.heatmap(corr)
plt.show()

Below is the plot that we will get:

corr

We choose columns with correlation >=0.97 as being highly correlated.

print('{:>30} {:>30} {:>30}'.format(*["Feature 1","Feature 2","Correlation"]))
x=[]
for i in c:
    for j in c:
        if((corr[i][j]>0.97) and i!=j and (i not in x)):
            l=len(i)+len(j)
            print('{:>30} {:>30} {:>30}'.format(*[i,j,corr[i][j]]))
            x.append(i)

And we will get like you can see in the image below:

correlated

Therefore, we now drop those columns with a high correlation of 0.97 or more with other columns.

for i in x:
    if(i in df.columns):
        df.drop(i,axis=1,inplace=True)

This brings us down from 41 to 33 input features. It can be observed that two columns, ‘is_host_login’ and have all values as 0. Hence, they are redundant and can be dropped as well.

df.drop('is_host_login',axis = 1, inplace=True)
df.drop('num_outbound_cmds',axis = 1, inplace=True)

dataframe

We now split it into input features and target variable, and then create the train and test dataset.

X= df.drop(['outcome'], axis=1)
Y=df['outcome']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test= train_test_split(X, Y, test_size=0.2, random_state=4 )

Evaluating a model

Now, after preparing the data, it is time to select a machine learning model for it. Let us try implementing the Random-Forest classifier.

from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification


rf = RandomForestClassifier(random_state=20,n_estimators=20)
rf.fit(X_train, y_train)

Let’s see the results of the test data.

y_preds=rf.predict(X_test)
print(confusion_matrix(y_test, y_preds))
print("Accuracy:",metrics.accuracy_score(y_test,y_preds)) 
print(metrics.classification_report(y_test,y_preds))

classification Report

As can be seen, the model performs very well on the given dataset, with an overall accuracy of over 99%. The individual precision-recall values for the various categories are also quite high, seen from the classification report. The category ‘u2r’ doesn’t perform so well, which is due to the fact that only 52 records belong to that category.

You can try further feature selection, analysis, and use different ML algorithms.

Leave a Reply

Your email address will not be published. Required fields are marked *