# Logistics Regression in python

Hi, today we are going to learn about **Logistic Regression in Python**. It is strongly recommended that you should have knowledge about regression and linear regression. Please watch this post – Fitting dataset into Linear Regression model .

## What is Logistic Regression

Basically, Regression divided into 3 different types.

**Linear Regression****Logistic Regression****Polynomial Regression**

So, Logistic regression is another type of regression. Regression used for predictive analysis. It is used for building a predictive model. Regression creates a relationship (equation) between the dependent variable and independent variable. In logistic regression, the outcome will be in Binary format like 0 or 1, High or Low, True or False, etc. The regression line will be an **S Curve** or **Sigmoid Curve**. The function of sigmoid is **( Y/1-Y). **So we can say logistic regression is used to get classified output.

### Difference between Linear Regression and Logistic Regression

**Linear Regression Graph**

**Logistic Regression Graph**

**In Linear Regression**: We used continuous data of Y.

**In Logistic Regression**: We used discrete or binary data of Y.

**In Linear Regression**: Outcome will be a decimal value.

**In Logistic Regression**: Outcome will be classified or binary like True Or False, High or Low, etc.

**In Linear Regression**: Regressor will be a straight line.

**In Logistic Regression**: Regressor line will be an S curve or Sigmoid curve.

**In Linear Regression**: Follows the equation: Y= mX+C.

**In Logistic Regression**: Follows the equation: Y= e^x + e^-x .

**In Linear Regression**: Example: House price prediction, Temperature prediction etc.

**In Logistic Regression**: Example: car purchasing prediction, rain prediction, etc.

The basic theoretical part of Logistic Regression is almost covered. Let’s see how to implement in python.

## Logistic Regression in Python

We are going to predict if a patient will be a victim of **Heart Diseases**.

Here we use a dataset from Kaggle.

Dataset Name is: **“framingham.csv”**

This is a Heart diseases records.

In this data set values are in 2 different types :

- Continuous: Real value
- Binary: “1”, means “Yes”, “0” means “No”

**This dataset’s column details are:**

**male**: male or female**age**: Age of the patient**currentSmoker**: whether or not the patient is a current smoker**cigsPerDay**: the number of cigarettes that the person smoked on average in one day**BPMeds**: whether or not the patient was on blood pressure medication**prevalentStroke**: whether or not the patient had previously had a stroke**prevalentHyp**: whether or not the patient was hypertensive**diabetes**: whether or not the patient had diabetes**totChol**: total cholesterol level**sysBP**: systolic blood pressure**diaBP**: diastolic blood pressure**BMI**: Body Mass Index**heartRate**: heart rate**glucos**e : glucose level**TenYearCHD**: 10-year risk of coronary heart disease

Required Library:

**Numpy Library****Pandas Library****Sklearn Library**

Let’s go for the code:

import numpy as np import pandas as pd data = pd.read_csv("framingham.csv") #importing the dataset data.sample(5)

**Output:**

Dataframe output Image:

**Explain:**

Here we import **Pandas** and **Numpy** library and also import the **“framingham.csv”** dataset and stored into the **data** variable as a pandas dataframe.

data.drop(['education'],axis=1,inplace=True) # removing the 'education' column data.shape # checking the shape

**Output:**

(4238, 15)

**Explain:**

Here we remove the **“education”** column. It is unnecessary for the prediction.

And we check the shape of the dataframe.

data.isnull().sum() #checking if any null value present

Output:

data.dtypes #checking the data types

**Output:**

**Explain:**

Here we check which column has which data type. It is necessary to make all column to numeric for fitting any model. Here all are in Numeric data type, which is good for us.

data['cigsPerDay'] = data['cigsPerDay'].astype(dtype='int64') data['BPMeds'] = data['BPMeds'].astype(dtype='int64') data['totChol'] = data['totChol'].astype(dtype='int64') data['heartRate'] = data['heartRate'].astype(dtype='int64') data['glucose'] = data['glucose'].astype(dtype='int64') data.dtypes #checking the data types

**Output:**

male int64 age int64 currentSmoker int64 cigsPerDay int64 BPMeds int64 prevalentStroke int64 prevalentHyp int64 diabetes int64 totChol int64 sysBP float64 diaBP float64 BMI float64 heartRate int64 glucose int64 TenYearCHD int64 dtype: object

**Explain:**

We changed many columns’ data type as the integer for our prediction. It is not mandatory.

X = data.iloc[:,0:-1] # All columns except last one as X y = data.iloc[:,-1] # Only last column as y

**Explain:**

We make an** X** variable and put all columns, except the last one. And we make** y** variable and put only last column.

from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.30,random_state=1) #splitting the data as train and test X_train.shape X_test.shape

**Output:**

(2624, 14) (1125, 14)

**Explain:**

Here we splitting the** X** and **y** into **X_train, X_test** and **y_train,y_test**. Into 70:30 ratio. And we check the shape of them.

from sklearn.linear_model import LogisticRegression l_reg = LogisticRegression() # Making a logistic regression model l_reg.fit(X_train,y_train) # Fitting the data

**Explain:**

We make a **l_reg** logistic regression model. And we fit the **X_train** & **y_train** data.

y_pred = l_reg.predict(X_test) # Predict the X_test data from sklearn import metrics metrics.accuracy_score(y_test,y_pred) # calculate the accuracy

**Output:**

**Explain: **

Here we predict the **X_test** data and store into the** y_pred** variable. Then we check the accuracy score.

We got accuracy score as 0.8497777777777777 means almost 85% accurate prediction which is pretty good. Thank you.

The whole program is available here: Logistics regression( Download from here )

You can also like to read:

- Fitting dataset into Linear Regression model
- A brief understanding on supervised learning – Machine Learning

Sorry Purnendu Das, but actually the performance of the model is mostly bad. The accuracy score wont help this time due to the fact that the dataset is actually unbalanaced. It is necessary to do a ROC and AUC score to really understand what is going on.