Catching Crooks on the Hook in Python Using Machine Learning
In today’s world crime is increasing day by day and the numbers of law enforcers are very less so to reduce crime we can use machine learning models to predict whether the person is a criminal or not. In this post, we build a model to predict about a person is criminal or not based on some of the features.
Criminal prediction using ML in Python
Most of the feature are categorical(‘ordinal’) except “ANALWT_C”. The dataset is taken from techgig. You can get Python notebook, data dictionary, and dataset from https://github.com/abhi9599fds/Posts_code .
Let’s get started.
- Import all needed libraries.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
- Load the CSV file from using pandas.
df = pd.read_csv('train.csv') print(df.head(2))
PERID IFATHER NRCH17_2 IRHHSIZ2 ... ANALWT_C VESTR VEREP Criminal 0 25095143 4 2 4 ... 3884.805998 40026.0 1.0 0.0 1 13005143 4 1 3 ... 1627.108106 40015.0 2.0 1.0 [2 rows x 72 columns]
- Check whether there are missing values in it or not. For this tutorial, we have dropped all of the missing value
PERID 0 IFATHER 0 NRCH17_2 0 IRHHSIZ2 0 IIHHSIZ2 0 .. AIIND102 1 ANALWT_C 1 VESTR 1 VEREP 1 Criminal 1 Length: 72, dtype: int64 #In last columns there are some missing values.
PERID IFATHER ... VEREP Criminal count 3.999900e+04 39999.000000 ... 39998.000000 39998.000000 mean 5.444733e+07 3.355684 ... 1.494400 0.069778 std 2.555308e+07 1.176259 ... 0.500125 0.254777 min 1.000222e+07 -1.000000 ... -1.000000 0.000000 25% 3.218566e+07 4.000000 ... 1.000000 0.000000 50% 5.420020e+07 4.000000 ... 1.000000 0.000000 75% 7.612463e+07 4.000000 ... 2.000000 0.000000 max 9.999956e+07 4.000000 ... 2.000000 1.000000 [8 rows x 72 columns]
- Perform some of EDA on the Dataset (‘I have shown EDA in my python notebook ‘).
def plot_dis(var): fig , ax = plt.subplots(nrows =1) sns.countplot(x =var , hue ='Criminal',data =df,ax = ax) plt.show() for i in df.columns[1 :]: plot_dis(i) df.dropna(inplace=True)
#see notebook for EDA
# for checking no. of classes df['Criminal'].value_counts()
0.0 37207 1.0 2791 Name: Criminal, dtype: int64
- Split The Data set into Train and testing data.
from sklearn.ensemble import ExtraTreesClassifier from sklearn.model_selection import train_test_split from sklearn.metrics import confusion_matrix , plot_roc_curve from imblearn.over_sampling import SMOTE smote = SMOTE() #stratify for equal no. of classes in train and test set x_train,x_test ,y_train,y_test = train_test_split(df.iloc[:,1:-1],df.iloc[:,-1], stratify=df.iloc[:,-1],test_size=0.2 ,random_state = 42) X_re ,y_re= smote.fit_resample(x_train,y_train)
- As we have seen that there is an imbalance in the data set criminal classes are very less. To solve this problem we use SMOTE (Synthetic Minority Oversampling Technique), a technique to balance the dataset. We will balance only training data not test data. In brief, Smote creates new instances of imbalance class using clustering and this is for oversampling.
- For many categorical features, we can use Tree-based models. We have used ExtraTreesClassifier.
clf = ExtraTreesClassifier() clf.fit(X_re,y_re) clf.score(x_test,y_test)
- Confusion matrix of the test set
array([[7232, 210], [ 236, 322]])
- Roc curve to see the fit
- In this, we have got an accuracy of 94% which is quite good without any feature engineering techniques.
- Pleases refer https://github.com/abhi9599fds/Posts_code . this is for code and dataset and description of the feature. EDA is also present in this.