Catching Crooks on the Hook in Python Using Machine Learning

Post Views: 805

In today’s world crime is increasing day by day and the numbers of law enforcers are very less so to reduce crime we can use machine learning models to predict whether the person is a criminal or not. In this post, we build a model to predict about a person is criminal or not based on some of the features.

Criminal prediction using ML in Python

Most of the feature are categorical(‘ordinal’) except “ANALWT_C”. The dataset is taken from techgig. You can get Python notebook, data dictionary, and dataset from https://github.com/abhi9599fds/Posts_code .

Let’s get started.

Import all needed libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Load the CSV file from using pandas.

df = pd.read_csv('train.csv')
print(df.head(2))

      PERID  IFATHER  NRCH17_2  IRHHSIZ2  ...     ANALWT_C    VESTR  VEREP  Criminal
0  25095143        4         2         4  ...  3884.805998  40026.0    1.0       0.0
1  13005143        4         1         3  ...  1627.108106  40015.0    2.0       1.0

[2 rows x 72 columns]

Check whether there are missing values in it or not. For this tutorial, we have dropped all of the missing value

print(df.isna().sum())

PERID       0
IFATHER     0
NRCH17_2    0
IRHHSIZ2    0
IIHHSIZ2    0
           ..
AIIND102    1
ANALWT_C    1
VESTR       1
VEREP       1
Criminal    1
Length: 72, dtype: int64

#In last columns there are some missing values.

df.describe()

              PERID       IFATHER  ...         VEREP      Criminal
count  3.999900e+04  39999.000000  ...  39998.000000  39998.000000
mean   5.444733e+07      3.355684  ...      1.494400      0.069778
std    2.555308e+07      1.176259  ...      0.500125      0.254777
min    1.000222e+07     -1.000000  ...     -1.000000      0.000000
25%    3.218566e+07      4.000000  ...      1.000000      0.000000
50%    5.420020e+07      4.000000  ...      1.000000      0.000000
75%    7.612463e+07      4.000000  ...      2.000000      0.000000
max    9.999956e+07      4.000000  ...      2.000000      1.000000

[8 rows x 72 columns]

Perform some of EDA on the Dataset (‘I have shown EDA in my python notebook ‘).

def plot_dis(var):
  fig , ax = plt.subplots(nrows =1)
  sns.countplot(x =var , hue ='Criminal',data =df,ax = ax)
  plt.show()

for i in df.columns[1 :]:
  plot_dis(i)

df.dropna(inplace=True)

#see notebook for EDA

# for checking no. of classes
df['Criminal'].value_counts()

0.0 37207 
1.0 2791 Name: Criminal, dtype: int64

Split The Data set into Train and testing data.

from sklearn.ensemble import ExtraTreesClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix , plot_roc_curve
from imblearn.over_sampling import SMOTE
smote = SMOTE()

#stratify for equal no. of classes in train and test set
x_train,x_test ,y_train,y_test = train_test_split(df.iloc[:,1:-1],df.iloc[:,-1], stratify=df.iloc[:,-1],test_size=0.2 ,random_state = 42)

X_re ,y_re= smote.fit_resample(x_train,y_train)

As we have seen that there is an imbalance in the data set criminal classes are very less. To solve this problem we use SMOTE (Synthetic Minority Oversampling Technique), a technique to balance the dataset. We will balance only training data not test data. In brief, Smote creates new instances of imbalance class using clustering and this is for oversampling.
For many categorical features, we can use Tree-based models. We have used ExtraTreesClassifier.
```
clf = ExtraTreesClassifier()
clf.fit(X_re,y_re)

clf.score(x_test,y_test)
```
```
output
0.94425
```

Confusion matrix of the test set

confusion_matrix(y_test, clf.predict(x_test))

array([[7232, 210], [ 236, 322]])

Roc curve to see the fit
```
plot_roc_curve( clf,x_test,y_test)
```
In this, we have got an accuracy of 94% which is quite good without any feature engineering techniques.
Pleases refer https://github.com/abhi9599fds/Posts_code . this is for code and dataset and description of the feature. EDA is also present in this.

Catching Crooks on the Hook in Python Using Machine Learning

Leave a Reply Cancel reply

Related Posts