Logistics Regression in python
Hi, today we are going to learn about Logistic Regression in Python. It is strongly recommended that you should have knowledge about regression and linear regression. Please watch this post – Fitting dataset into Linear Regression model .
What is Logistic Regression
Basically, Regression divided into 3 different types.
- Linear Regression
- Logistic Regression
- Polynomial Regression
So, Logistic regression is another type of regression. Regression used for predictive analysis. It is used for building a predictive model. Regression creates a relationship (equation) between the dependent variable and independent variable. In logistic regression, the outcome will be in Binary format like 0 or 1, High or Low, True or False, etc. The regression line will be an S Curve or Sigmoid Curve. The function of sigmoid is ( Y/1-Y). So we can say logistic regression is used to get classified output.
Difference between Linear Regression and Logistic Regression
Linear Regression Graph
Logistic Regression Graph
In Linear Regression: We used continuous data of Y.
In Logistic Regression: We used discrete or binary data of Y.
In Linear Regression: Outcome will be a decimal value.
In Logistic Regression: Outcome will be classified or binary like True Or False, High or Low, etc.
In Linear Regression: Regressor will be a straight line.
In Logistic Regression: Regressor line will be an S curve or Sigmoid curve.
In Linear Regression: Follows the equation: Y= mX+C.
In Logistic Regression: Follows the equation: Y= e^x + e^-x .
In Linear Regression: Example: House price prediction, Temperature prediction etc.
In Logistic Regression: Example: car purchasing prediction, rain prediction, etc.
The basic theoretical part of Logistic Regression is almost covered. Let’s see how to implement in python.
Logistic Regression in Python
We are going to predict if a patient will be a victim of Heart Diseases.
Here we use a dataset from Kaggle.
Dataset Name is: “framingham.csv”
This is a Heart diseases records.
In this data set values are in 2 different types :
- Continuous: Real value
- Binary: “1”, means “Yes”, “0” means “No”
This dataset’s column details are:
- male : male or female
- age: Age of the patient
- currentSmoker : whether or not the patient is a current smoker
- cigsPerDay : the number of cigarettes that the person smoked on average in one day
- BPMeds : whether or not the patient was on blood pressure medication
- prevalentStroke : whether or not the patient had previously had a stroke
- prevalentHyp: whether or not the patient was hypertensive
- diabetes: whether or not the patient had diabetes
- totChol: total cholesterol level
- sysBP : systolic blood pressure
- diaBP : diastolic blood pressure
- BMI: Body Mass Index
- heartRate : heart rate
- glucose : glucose level
- TenYearCHD : 10-year risk of coronary heart disease
- Numpy Library
- Pandas Library
- Sklearn Library
Let’s go for the code:
import numpy as np import pandas as pd data = pd.read_csv("framingham.csv") #importing the dataset data.sample(5)
Dataframe output Image:
Here we import Pandas and Numpy library and also import the “framingham.csv” dataset and stored into the data variable as a pandas dataframe.
data.drop(['education'],axis=1,inplace=True) # removing the 'education' column data.shape # checking the shape
Here we remove the “education” column. It is unnecessary for the prediction.
And we check the shape of the dataframe.
data.isnull().sum() #checking if any null value present
male 0 age 0 currentSmoker 0 cigsPerDay 29 BPMeds 53 prevalentStroke 0 prevalentHyp 0 diabetes 0 totChol 50 sysBP 0 diaBP 0 BMI 19 heartRate 1 glucose 388 TenYearCHD 0 dtype: int64
Here we check if any null value is present or not. It is strongly recommended not to use any null/Nan value for fitting the dataset. And we found a lot of Null values are present in our data set.
data = data.dropna() # Remove the null values row data.isnull().sum() # Check if any null value present
male 0 age 0 currentSmoker 0 cigsPerDay 0 BPMeds 0 prevalentStroke 0 prevalentHyp 0 diabetes 0 totChol 0 sysBP 0 diaBP 0 BMI 0 heartRate 0 glucose 0 TenYearCHD 0 dtype: int64
Here we remove those rows where any null/Nan value was present.
Then we check if any null/Nan value is present or not.
We didn’t find any null value. So we can perform our next task.
data.shape #Check the shape
We check the shape of the current dataset. we got 3,749 rows and 15 columns. Which is enough to make a small predictive model.
data.dtypes #checking the data types
male int64 age int64 currentSmoker int64 cigsPerDay float64 BPMeds float64 prevalentStroke int64 prevalentHyp int64 diabetes int64 totChol float64 sysBP float64 diaBP float64 BMI float64 heartRate float64 glucose float64 TenYearCHD int64 dtype: object
Here we check which column has which data type. It is necessary to make all column to numeric for fitting any model. Here all are in Numeric data type, which is good for us.
data['cigsPerDay'] = data['cigsPerDay'].astype(dtype='int64') data['BPMeds'] = data['BPMeds'].astype(dtype='int64') data['totChol'] = data['totChol'].astype(dtype='int64') data['heartRate'] = data['heartRate'].astype(dtype='int64') data['glucose'] = data['glucose'].astype(dtype='int64') data.dtypes #checking the data types
male int64 age int64 currentSmoker int64 cigsPerDay int64 BPMeds int64 prevalentStroke int64 prevalentHyp int64 diabetes int64 totChol int64 sysBP float64 diaBP float64 BMI float64 heartRate int64 glucose int64 TenYearCHD int64 dtype: object
We changed many columns’ data type as the integer for our prediction. It is not mandatory.
X = data.iloc[:,0:-1] # All columns except last one as X y = data.iloc[:,-1] # Only last column as y
We make an X variable and put all columns, except the last one. And we make y variable and put only last column.
from sklearn.model_selection import train_test_split X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=.30,random_state=1) #splitting the data as train and test X_train.shape X_test.shape
(2624, 14) (1125, 14)
Here we splitting the X and y into X_train, X_test and y_train,y_test. Into 70:30 ratio. And we check the shape of them.
from sklearn.linear_model import LogisticRegression l_reg = LogisticRegression() # Making a logistic regression model l_reg.fit(X_train,y_train) # Fitting the data
We make a l_reg logistic regression model. And we fit the X_train & y_train data.
y_pred = l_reg.predict(X_test) # Predict the X_test data from sklearn import metrics metrics.accuracy_score(y_test,y_pred) # calculate the accuracy
Here we predict the X_test data and store into the y_pred variable. Then we check the accuracy score.
We got accuracy score as 0.8497777777777777 means almost 85% accurate prediction which is pretty good. Thank you.
The whole program is available here: Logistics regression( Download from here )
You can also like to read:
- Fitting dataset into Linear Regression model
- A brief understanding on supervised learning – Machine Learning