Logistics Regression in python

Post Views: 1,551

Hi, today we are going to learn about Logistic Regression in Python. It is strongly recommended that you should have knowledge about regression and linear regression. Please watch this post – Fitting dataset into Linear Regression model .

What is Logistic Regression

Basically, Regression divided into 3 different types.

Linear Regression
Logistic Regression
Polynomial Regression

So, Logistic regression is another type of regression. Regression used for predictive analysis. It is used for building a predictive model. Regression creates a relationship (equation) between the dependent variable and independent variable. In logistic regression, the outcome will be in Binary format like 0 or 1, High or Low, True or False, etc. The regression line will be an S Curve or Sigmoid Curve. The function of sigmoid is ( Y/1-Y). So we can say logistic regression is used to get classified output.

Difference between Linear Regression and Logistic Regression

Linear Regression Graph

Linear regression graph

Logistic Regression Graph

Logistic Regression Graph

In Linear Regression: We used continuous data of Y.

In Logistic Regression: We used discrete or binary data of Y.

In Linear Regression: Outcome will be a decimal value.

In Logistic Regression: Outcome will be classified or binary like True Or False, High or Low, etc.

In Linear Regression: Regressor will be a straight line.

In Logistic Regression: Regressor line will be an S curve or Sigmoid curve.

In Linear Regression: Follows the equation: Y= mX+C.

In Logistic Regression: Follows the equation: Y= e^x + e^-x .

In Linear Regression: Example: House price prediction, Temperature prediction etc.

In Logistic Regression: Example: car purchasing prediction, rain prediction, etc.

The basic theoretical part of Logistic Regression is almost covered. Let’s see how to implement in python.

Logistic Regression in Python

We are going to predict if a patient will be a victim of Heart Diseases.

Here we use a dataset from Kaggle.

Dataset Name is: “framingham.csv”

URL: https://www.kaggle.com/dileep070/heart-disease-prediction-using-logistic-regression/downloads/framingham.csv/1

This is a Heart diseases records.

In this data set values are in 2 different types :

Continuous: Real value
Binary: “1”, means “Yes”, “0” means “No”

This dataset’s column details are:

male : male or female
age: Age of the patient
currentSmoker : whether or not the patient is a current smoker
cigsPerDay : the number of cigarettes that the person smoked on average in one day
BPMeds : whether or not the patient was on blood pressure medication
prevalentStroke : whether or not the patient had previously had a stroke
prevalentHyp: whether or not the patient was hypertensive
diabetes: whether or not the patient had diabetes
totChol: total cholesterol level
sysBP : systolic blood pressure
diaBP : diastolic blood pressure
BMI: Body Mass Index
heartRate : heart rate
glucose : glucose level
TenYearCHD : 10-year risk of coronary heart disease

Required Library:

Numpy Library
Pandas Library
Sklearn Library

Let’s go for the code:

import numpy as np
import pandas as pd

data = pd.read_csv("framingham.csv") #importing the dataset
data.sample(5)

Output:

Dataframe output Image:

dataframe output

Explain:

Here we import Pandas and Numpy library and also import the “framingham.csv” dataset and stored into the data variable as a pandas dataframe.

data.drop(['education'],axis=1,inplace=True) # removing the 'education' column
data.shape # checking the shape

Output:

(4238, 15)

Explain:

Here we remove the “education” column. It is unnecessary for the prediction.

And we check the shape of the dataframe.

data.isnull().sum() #checking if any null value present

Output:

male                 0
age                  0
currentSmoker        0
cigsPerDay          29
BPMeds              53
prevalentStroke      0
prevalentHyp         0
diabetes             0
totChol             50
sysBP                0
diaBP                0
BMI                 19
heartRate            1
glucose            388
TenYearCHD           0
dtype: int64

Explain:

Here we check if any null value is present or not. It is strongly recommended not to use any null/Nan value for fitting the dataset. And we found a lot of Null values are present in our data set.

data = data.dropna() # Remove the null values row
data.isnull().sum() # Check if any null value present

Output:

male               0
age                0
currentSmoker      0
cigsPerDay         0
BPMeds             0
prevalentStroke    0
prevalentHyp       0
diabetes           0
totChol            0
sysBP              0
diaBP              0
BMI                0
heartRate          0
glucose            0
TenYearCHD         0
dtype: int64

Explain:

Here we remove those rows where any null/Nan value was present.

Then we check if any null/Nan value is present or not.

We didn’t find any null value. So we can perform our next task.

data.shape #Check the shape

Output:

(3749, 15)

Explain:

We check the shape of the current dataset. we got 3,749 rows and 15 columns. Which is enough to make a small predictive model.

data.dtypes #checking the data types

Output:

male                 int64
age                  int64
currentSmoker        int64
cigsPerDay         float64
BPMeds             float64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol            float64
sysBP              float64
diaBP              float64
BMI                float64
heartRate          float64
glucose            float64
TenYearCHD           int64
dtype: object

Explain:

Here we check which column has which data type. It is necessary to make all column to numeric for fitting any model. Here all are in Numeric data type, which is good for us.

data['cigsPerDay'] = data['cigsPerDay'].astype(dtype='int64')
data['BPMeds'] = data['BPMeds'].astype(dtype='int64')
data['totChol'] = data['totChol'].astype(dtype='int64')
data['heartRate'] = data['heartRate'].astype(dtype='int64')
data['glucose'] = data['glucose'].astype(dtype='int64')

data.dtypes #checking the data types

Output:

male                 int64
age                  int64
currentSmoker        int64
cigsPerDay           int64
BPMeds               int64
prevalentStroke      int64
prevalentHyp         int64
diabetes             int64
totChol              int64
sysBP              float64
diaBP              float64
BMI                float64
heartRate            int64
glucose              int64
TenYearCHD           int64
dtype: object

Explain:

We changed many columns’ data type as the integer for our prediction. It is not mandatory.

X = data.iloc[:,0:-1] # All columns except last one as X
y = data.iloc[:,-1] # Only last column as y

Explain:

We make an X variable and put all columns, except the last one. And we make y variable and put only last column.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.30,random_state=1) #splitting the data as train and test
X_train.shape
X_test.shape

Output:

(2624, 14)
(1125, 14)

Explain:

Here we splitting the X and y into X_train, X_test and y_train,y_test. Into 70:30 ratio. And we check the shape of them.

from sklearn.linear_model import LogisticRegression
l_reg = LogisticRegression() # Making a logistic regression model
l_reg.fit(X_train,y_train) # Fitting the data

Explain:

We make a l_reg logistic regression model. And we fit the X_train & y_train data.

y_pred = l_reg.predict(X_test) # Predict the X_test data
from sklearn import metrics
metrics.accuracy_score(y_test,y_pred) # calculate the accuracy

Output:

0.8497777777777777

Explain:

Here we predict the X_test data and store into the y_pred variable. Then we check the accuracy score.

We got accuracy score as 0.8497777777777777 means almost 85% accurate prediction which is pretty good. Thank you.

The whole program is available here: Logistics regression( Download from here )

You can also like to read:

One response to “Logistics Regression in python”

Daniel says:

April 10, 2020 at 10:27 pm

Sorry Purnendu Das, but actually the performance of the model is mostly bad. The accuracy score wont help this time due to the fact that the dataset is actually unbalanaced. It is necessary to do a ROC and AUC score to really understand what is going on.

Reply

Logistics Regression in python

What is Logistic Regression

Difference between Linear Regression and Logistic Regression

Logistic Regression in Python

One response to “Logistics Regression in python”

Leave a Reply Cancel reply

Related Posts