Prediction Intervals in Python using Machine learning

Today we’ll learn about getting Prediction intervals in Python using Machine learning. The prediction that we get in every machine learning algorithm consists of some errors. The approximate range in which our prediction can lie is called the prediction interval. Here, the library we’re using for prediction intervals in Python using machine learning is Scikit-Learn.

So let’s get coding!

Getting dataset for prediction intervals in Python

First, we import pandas and read the .csv file of some dataset. Get the dataset I used from this link. And then take a look at a snippet of the dataset using the df.head() method. In the current dataset, our aim is to predict the salary of an employee from his years of experience.

import pandas as pd
df=pd.read_csv("Salary.csv")
df.head()

Output:

YearsExperience Salary
0 1.1 39343
1 1.3 46205
2 1.5 37731
3 2.0 43525
4 2.2 39891

 

Splitting the dataset

We now have to split the dataset into training and testing data. We make use of the train_test_split() method of the sklearn.model_selection module.

from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest=train_test_split(df["YearsExperience"],df["Salary"],test_size=0.2)

Building the model

Now, we need to train our model. We will use the GradientBoostingRegressor() function from the sklearn.ensemble module. To know more about GradientBoostingRegressor(), visit its documentation. We’re defining two models, the upper quantile and the lower quantile.

from sklearn.ensemble import GradientBoostingRegressor
m1 = GradientBoostingRegressor(loss="quantile",alpha=0.1)
m2 = GradientBoostingRegressor(loss="quantile",alpha=0.6)

We then fit these two models with the training data. We first need to convert the data which is in pandas DataFrame to a numpy array. After this, we also need to reshape the array to a 2D array.

import numpy as np
m1.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1)))
m2.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1)))

Testing the model

At this point, we have completed the training of our models. Now, let’s test our model. We create a new DataFrame “pred” which has the actual salary ie. our target values. Then we add the lower quantile and upper quantile values that we predicted.

pred=pd.DataFrame(ytest)      #Actual value
pred["lower quartile"]=m1.predict(np.reshape(np.array(Xtest),(-1,1)))
pred["upper quartile"]=m2.predict(np.reshape(np.array(Xtest),(-1,1)))
pred

Output:

Salary lower quartile upper quartile
11 55794 56920.534822 58796.804179
23 113812 99888.378505 101340.774522
25 105582 99888.378505 109418.091037
15 67938 66028.628587 66030.115014
18 81363 91775.156479 93940.000830
29 121872 99888.378505 122537.665812
5 56642 54619.305749 59532.025317

Visualizing: prediction intervals in Python

To better help understand the prediction values we got, we’ll plot the values on a graph.

import matplotlib.pyplot as plt
plt.plot(Xtest,pred["Salary"],'o',color='red')
plt.plot(Xtest,pred["lower quartile"],'o',color='blue')
plt.plot(Xtest,pred["upper quartile"],'o',color='green')

Output:

Prediction Intervals in Python using Machine learning

Also, check out other machine learning programs:

KNN Classification using Scikit-Learn in Python

Predicting insurance using Scikit-Learn in Python

Predicting next number in a sequence with Scikit-Learn in Python

Image Classification using Keras in TensorFlow Backend

Leave a Reply

Your email address will not be published. Required fields are marked *