Loan Prediction Project using Machine Learning in Python

The dataset Loan Prediction: Machine Learning is indispensable for the beginner in Data Science, this dataset allows you to work on supervised learning, more preciously a classification problem. This is the reason why I would like to introduce you to an analysis of this one.

We have data of some predicted loans from history. So when there is name of some ‘Data’ there is a lot interesting for ‘Data Scientists’. I have explored dataset and found a lot interesting facts about loan prediction.

The first part is going to focus on data analysis and Data visualization. The second one we are going to see the about algorithm used to tackle our problem.

The purpose of this analysis is to predict the loan eligibility process.

  • Here I have provided a data set. Here I have provided a data set.

As to proceed further,We need to download Test & Train data set.

test and train dataset.zip

# Importing Library
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder

# Reading the training dataset in a dataframe using Pandas
df = pd.read_csv("train.csv")

# Reading the test dataset in a dataframe using Pandas
test = pd.read_csv("test.csv")

Output:
First 10 row of training dataset

# Store total number of observation in training dataset
df_length =len(df)

# Store total number of columns in testing data set
test_col = len(test.columns)

Understanding the various features (columns) of the dataset:

# Summary of numerical variables for training data set

df.describe()

For the non-numerical values (e.g. Property_Area, Credit_History,etc.), we can look at frequency distribution to understand whether they make sense or not.

# Get the unique values and their frequency of variable Property_Area

df['Property_Area'].value_counts()

Output:

Semiurban    233
Urban        202
Rural        179
Name: Property_Area, dtype: int64

Understanding the Distribution of Numerical Variables

  • ApplicantIncome
  • LoanAmount
# Box Plot for understanding the distributions and to observe the outliers.

%matplotlib inline

# Histogram of variable ApplicantIncome

df['ApplicantIncome'].hist()

 

# Box Plot for variable ApplicantIncome of training data set

df.boxplot(column='ApplicantIncome')

The above Box Plot confirms the presence of a lot of outliers/extreme values. This can be attributed to the income disparity in the society.

# Box Plot for variable ApplicantIncome by variable Education of training data set

df.boxplot(column='ApplicantIncome', by = 'Education')

We can see that there is no substantial different between the mean income of graduate and non-graduates. But graduates with a very high incomes are appearing to be the outliers

# Histogram of variable LoanAmount

df['LoanAmount'].hist(bins=50)
# Box Plot for variable LoanAmount of training data set

df.boxplot(column='LoanAmount')
# Box Plot for variable LoanAmount by variable Gender of training data set

df.boxplot(column='LoanAmount', by = 'Gender')

LoanAmount has missing as well as extreme values, while ApplicantIncome has a few extreme values.

Understanding Distribution of Categorical Variables:

# Loan approval rates in absolute numbers
loan_approval = df['Loan_Status'].value_counts()['Y']
print(loan_approval)

Output:

422

422 number of loans were approved.
# Credit History and Loan Status
pd.crosstab(df ['Credit_History'], df ['Loan_Status'], margins=True)
#Function to output percentage row wise in a cross table
def percentageConvert(ser):
    return ser/float(ser[-1])

# Loan approval rate for customers having Credit_History (1)
df=pd.crosstab(df ["Credit_History"], df ["Loan_Status"], margins=True).apply(percentageConvert, axis=1)
loan_approval_with_Credit_1 = df['Y'][1]
print(loan_approval_with_Credit_1*100)
Output:
79.57894736842105

79.58 % of the applicants whose loans were approved have Credit_History equals to 1.
df['Y']

Output:

Credit_History
0.0    0.078652
1.0    0.795789
All    0.682624
Name: Y, dtype: float64

# Replace missing value of Self_Employed with more frequent category
df['Self_Employed'].fillna('No',inplace=True)

Outliers of LoanAmount and Applicant Income:

# Add both ApplicantIncome and CoapplicantIncome to TotalIncome
df['TotalIncome'] = df['ApplicantIncome'] + df['CoapplicantIncome']

# Looking at the distribtion of TotalIncome
df['LoanAmount'].hist(bins=20)

The extreme values are practically possible, i.e. some people might apply for high value loans due to specific needs. So instead of treating them as outliers, let’s try a log transformation to nullify their effect:

# Perform log transformation of TotalIncome to make it closer to normal
df['LoanAmount_log'] = np.log(df['LoanAmount'])

# Looking at the distribtion of TotalIncome_log
df['LoanAmount_log'].hist(bins=20)

Data Preparation for Model Building:

  • sklearn requires all inputs to be numeric, we should convert all our categorical variables into numeric by encoding the categories. Before that we will fill all the missing values in the dataset.
# Impute missing values for Gender
df['Gender'].fillna(df['Gender'].mode()[0],inplace=True)

# Impute missing values for Married
df['Married'].fillna(df['Married'].mode()[0],inplace=True)

# Impute missing values for Dependents
df['Dependents'].fillna(df['Dependents'].mode()[0],inplace=True)

# Impute missing values for Credit_History
df['Credit_History'].fillna(df['Credit_History'].mode()[0],inplace=True)

# Convert all non-numeric values to number
cat=['Gender','Married','Dependents','Education','Self_Employed','Credit_History','Property_Area']

for var in cat:
    le = preprocessing.LabelEncoder()
    df[var]=le.fit_transform(df[var].astype('str'))
df.dtypes
Output:
Loan_ID               object
Gender                 int64
Married                int64
Dependents             int64
Education              int64
Self_Employed          int64
ApplicantIncome        int64
CoapplicantIncome    float64
LoanAmount           float64
Loan_Amount_Term     float64
Credit_History         int64
Property_Area          int64
Loan_Status           object
dtype:                object

Generic Classification Function:

#Import models from scikit learn module:
from sklearn import metrics
from sklearn.cross_validation import KFold

#Generic function for making a classification model and accessing performance:

def classification_model(model, data, predictors, outcome):
    #Fit the model:
    model.fit(data[predictors],data[outcome])
  
    #Make predictions on training set:
    predictions = model.predict(data[predictors])
  
    #Print accuracy
    accuracy = metrics.accuracy_score(predictions,data[outcome])
    print ("Accuracy : %s" % "{0:.3%}".format(accuracy))

    #Perform k-fold cross-validation with 5 folds
    kf = KFold(data.shape[0], n_folds=5)
    error = []
    for train, test in kf:
        # Filter training data
        train_predictors = (data[predictors].iloc[train,:])
    
        # The target we're using to train the algorithm.
        train_target = data[outcome].iloc[train]
    
        # Training the algorithm using the predictors and target.
        model.fit(train_predictors, train_target)
    
        #Record error from each cross-validation run
        error.append(model.score(data[predictors].iloc[test,:], data[outcome].iloc[test]))
 
    print ("Cross-Validation Score : %s" % "{0:.3%}".format(np.mean(error)))

    #Fit the model again so that it can be refered outside the function:
    model.fit(data[predictors],data[outcome])

Model Building:

#Combining both train and test dataset

#Create a flag for Train and Test Data set
df['Type']='Train' 
test['Type']='Test'
fullData = pd.concat([df,test],axis=0, sort=True)

#Look at the available missing values in the dataset
fullData.isnull().sum()

Output:

ApplicantIncome        0
CoapplicantIncome      0
Credit_History        29
Dependents            10
Education              0
Gender                11
LoanAmount            27
LoanAmount_log       389
Loan_Amount_Term      20
Loan_ID                0
Loan_Status          367
Married                0
Property_Area          0
Self_Employed         23
Type                   0
dtype:             int64
#Identify categorical and continuous variables
ID_col = ['Loan_ID']
target_col = ["Loan_Status"]
cat_cols = ['Credit_History','Dependents','Gender','Married','Education','Property_Area','Self_Employed']
#Imputing Missing values with mean for continuous variable
fullData['LoanAmount'].fillna(fullData['LoanAmount'].mean(), inplace=True)
fullData['LoanAmount_log'].fillna(fullData['LoanAmount_log'].mean(), inplace=True)
fullData['Loan_Amount_Term'].fillna(fullData['Loan_Amount_Term'].mean(), inplace=True)
fullData['ApplicantIncome'].fillna(fullData['ApplicantIncome'].mean(), inplace=True)
fullData['CoapplicantIncome'].fillna(fullData['CoapplicantIncome'].mean(), inplace=True)

#Imputing Missing values with mode for categorical variables
fullData['Gender'].fillna(fullData['Gender'].mode()[0], inplace=True)
fullData['Married'].fillna(fullData['Married'].mode()[0], inplace=True)
fullData['Dependents'].fillna(fullData['Dependents'].mode()[0], inplace=True)
fullData['Loan_Amount_Term'].fillna(fullData['Loan_Amount_Term'].mode()[0], inplace=True)
fullData['Credit_History'].fillna(fullData['Credit_History'].mode()[0], inplace=True)
#Create a new column as Total Income

fullData['TotalIncome']=fullData['ApplicantIncome'] + fullData['CoapplicantIncome']

fullData['TotalIncome_log'] = np.log(fullData['TotalIncome'])

#Histogram for Total Income
fullData['TotalIncome_log'].hist(bins=20)
#create label encoders for categorical features
for var in cat_cols:
    number = LabelEncoder()
    fullData[var] = number.fit_transform(fullData[var].astype('str'))

train_modified=fullData[fullData['Type']=='Train']
test_modified=fullData[fullData['Type']=='Test']
train_modified["Loan_Status"] = number.fit_transform(train_modified["Loan_Status"].astype('str'))

Logistic Regression Model:

  1. The chances of getting a loan will be higher for:
  • Applicants having a credit history (we observed this in exploration).
  • Applicants with higher applicant and co-applicant incomes.
  • Applicants with higher education level.
  • Properties in urban areas with high growth perspectives.

So let’s make our model with ‘Credit_History’, ‘Education’ & ‘Gender’.

from sklearn.linear_model import LogisticRegression


predictors_Logistic=['Credit_History','Education','Gender']

x_train = train_modified[list(predictors_Logistic)].values
y_train = train_modified["Loan_Status"].values

x_test=test_modified[list(predictors_Logistic)].values
# Create logistic regression object
model = LogisticRegression()

# Train the model using the training sets
model.fit(x_train, y_train)

#Predict Output
predicted= model.predict(x_test)

#Reverse encoding for predicted outcome
predicted = number.inverse_transform(predicted)

#Store it to test dataset
test_modified['Loan_Status']=predicted

outcome_var = 'Loan_Status'

classification_model(model, df,predictors_Logistic,outcome_var)

test_modified.to_csv("Logistic_Prediction.csv",columns=['Loan_ID','Loan_Status'])

Output:

Accuracy : 80.945%
Cross-Validation Score : 80.946%

NOTE: This Project works best in Jupyter notebook.

8 responses to “Loan Prediction Project using Machine Learning in Python”

  1. Tawfiq says:

    Code is showing error after replacing self_employed value from true to no, Sir.

    • Sanskar Dwivedi says:

      https://drive.google.com/open?id=113KSST6C7PCfKoCDbdK-R-aZX-SypQX7
      Hi Tawfiq, Here is the link through which you can download the working code of the above article
      It will help you.

      • abc says:

        KeyError: ‘Self_Employed’

        During handling of the above exception, another exception occurred:

        KeyError Traceback (most recent call last)
        2 frames
        /usr/local/lib/python3.6/dist-packages/pandas/core/indexes/base.py in get_loc(self, key, method, tolerance)
        2646 return self._engine.get_loc(key)
        2647 except KeyError:
        -> 2648 return self._engine.get_loc(self._maybe_cast_indexer(key))
        2649 indexer = self.get_indexer([key], method=method, tolerance=tolerance)
        2650 if indexer.ndim > 1 or indexer.size > 1:
        how to proceed???
        please reply fast….

      • arpit says:

        getting error in replace from google drive

  2. aswin adithiya says:

    im getting the following error:
    —————————————————————————
    ValueError Traceback (most recent call last)
    in
    7 predicted=model.predict(x_test)
    8 #Reverse encoding for predicted outcome
    —-> 9 predicted=number.inverse_transform(predicted)
    10
    11 test_modified[‘Loan_Status’]=predicted
    ValueError: y contains previously unseen labels: [‘N’ ‘Y’]

  3. Priya says:

    Hi,
    could you help me getting the train and test data

    Thanks in advance 🙂

  4. Jayashree T says:

    Sir,could you please provide Logistic_Prediction.csv file .

  5. S.Nithiya says:

    Guys, let my comments may be useful for someone who having repeated error in key value, here we are comparing different fields to get understanding of the data in the different forms of boxplot and histogram.
    Up to credit history we are doing with df variable so it stores the last credit history value in df.
    so every time we have to run the first train dataset code save as df to be and handle the remaining process to be followed.

Leave a Reply

Your email address will not be published. Required fields are marked *