Catching Crooks on the Hook in Python Using Machine Learning

In today’s world crime is increasing day by day and the numbers of law enforcers are very less so to reduce crime we can use machine learning models to predict whether the person is a criminal or not. In this post, we build a model to predict about a person is criminal or not based on some of the features.

Criminal prediction using ML in Python

Most of the feature are categorical(‘ordinal’) except “ANALWT_C”. The dataset is taken from techgig. You can get Python notebook, data dictionary, and dataset from https://github.com/abhi9599fds/Posts_code .

Let’s get started.

  •  Import all needed libraries.
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    import seaborn as sns
    
  •  Load the CSV file from using pandas.
    df = pd.read_csv('train.csv')
    print(df.head(2))
          PERID  IFATHER  NRCH17_2  IRHHSIZ2  ...     ANALWT_C    VESTR  VEREP  Criminal
    0  25095143        4         2         4  ...  3884.805998  40026.0    1.0       0.0
    1  13005143        4         1         3  ...  1627.108106  40015.0    2.0       1.0
    
    [2 rows x 72 columns]
  •  Check whether there are missing values in it or not. For this tutorial, we have dropped all of the missing value
    print(df.isna().sum())
    PERID       0
    IFATHER     0
    NRCH17_2    0
    IRHHSIZ2    0
    IIHHSIZ2    0
               ..
    AIIND102    1
    ANALWT_C    1
    VESTR       1
    VEREP       1
    Criminal    1
    Length: 72, dtype: int64
    
    #In last columns there are some missing values.
    df.describe()
                  PERID       IFATHER  ...         VEREP      Criminal
    count  3.999900e+04  39999.000000  ...  39998.000000  39998.000000
    mean   5.444733e+07      3.355684  ...      1.494400      0.069778
    std    2.555308e+07      1.176259  ...      0.500125      0.254777
    min    1.000222e+07     -1.000000  ...     -1.000000      0.000000
    25%    3.218566e+07      4.000000  ...      1.000000      0.000000
    50%    5.420020e+07      4.000000  ...      1.000000      0.000000
    75%    7.612463e+07      4.000000  ...      2.000000      0.000000
    max    9.999956e+07      4.000000  ...      2.000000      1.000000
    
    [8 rows x 72 columns]
  • Perform some of EDA on the Dataset (‘I have shown EDA in my python notebook ‘).
    def plot_dis(var):
      fig , ax = plt.subplots(nrows =1)
      sns.countplot(x =var , hue ='Criminal',data =df,ax = ax)
      plt.show()
    
    for i in df.columns[1 :]:
      plot_dis(i)
    
    df.dropna(inplace=True)

    #see notebook for EDA

  • # for checking no. of classes
    df['Criminal'].value_counts()
    0.0 37207 
    1.0 2791 Name: Criminal, dtype: int64
  • Split The Data set into Train and testing data.
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix , plot_roc_curve
    from imblearn.over_sampling import SMOTE
    smote = SMOTE()
    
    #stratify for equal no. of classes in train and test set
    x_train,x_test ,y_train,y_test = train_test_split(df.iloc[:,1:-1],df.iloc[:,-1], stratify=df.iloc[:,-1],test_size=0.2 ,random_state = 42)
    
    X_re ,y_re= smote.fit_resample(x_train,y_train)
  • As we have seen that there is an imbalance in the data set criminal classes are very less. To solve this problem we use SMOTE (Synthetic Minority Oversampling Technique), a technique to balance the dataset. We will balance only training data not test data. In brief, Smote creates new instances of imbalance class using clustering and this is for oversampling.
  • For many categorical features, we can use Tree-based models. We have used ExtraTreesClassifier.
    clf = ExtraTreesClassifier()
    clf.fit(X_re,y_re)
    
    clf.score(x_test,y_test)
    output
    0.94425
  • Confusion matrix of the test set
    confusion_matrix(y_test, clf.predict(x_test))
    array([[7232, 210], [ 236, 322]])
  • Roc curve to see the fit
    plot_roc_curve( clf,x_test,y_test)

     

  • In this, we have got an accuracy of 94%  which is quite good without any feature engineering techniques.
  • Pleases refer https://github.com/abhi9599fds/Posts_code . this is for code and dataset and description of the feature. EDA is also present in this.

Leave a Reply

Your email address will not be published. Required fields are marked *