Prediction Intervals in Python using Machine learning

Post Views: 1,056

Today we’ll learn about getting Prediction intervals in Python using Machine learning. The prediction that we get in every machine learning algorithm consists of some errors. The approximate range in which our prediction can lie is called the prediction interval. Here, the library we’re using for prediction intervals in Python using machine learning is Scikit-Learn.

So let’s get coding!

Getting dataset for prediction intervals in Python

First, we import pandas and read the .csv file of some dataset. Get the dataset I used from this link. And then take a look at a snippet of the dataset using the df.head() method. In the current dataset, our aim is to predict the salary of an employee from his years of experience.

import pandas as pd
df=pd.read_csv("Salary.csv")
df.head()

Output:

	YearsExperience	Salary
0	1.1	39343
1	1.3	46205
2	1.5	37731
3	2.0	43525
4	2.2	39891

Splitting the dataset

We now have to split the dataset into training and testing data. We make use of the train_test_split() method of the sklearn.model_selection module.

from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest=train_test_split(df["YearsExperience"],df["Salary"],test_size=0.2)

Building the model

Now, we need to train our model. We will use the GradientBoostingRegressor() function from the sklearn.ensemble module. To know more about GradientBoostingRegressor(), visit its documentation. We’re defining two models, the upper quantile and the lower quantile.

from sklearn.ensemble import GradientBoostingRegressor
m1 = GradientBoostingRegressor(loss="quantile",alpha=0.1)
m2 = GradientBoostingRegressor(loss="quantile",alpha=0.6)

We then fit these two models with the training data. We first need to convert the data which is in pandas DataFrame to a numpy array. After this, we also need to reshape the array to a 2D array.

import numpy as np
m1.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1)))
m2.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1)))

Testing the model

At this point, we have completed the training of our models. Now, let’s test our model. We create a new DataFrame “pred” which has the actual salary ie. our target values. Then we add the lower quantile and upper quantile values that we predicted.

pred=pd.DataFrame(ytest)      #Actual value
pred["lower quartile"]=m1.predict(np.reshape(np.array(Xtest),(-1,1)))
pred["upper quartile"]=m2.predict(np.reshape(np.array(Xtest),(-1,1)))

pred

Output:

	Salary	lower quartile	upper quartile
11	55794	56920.534822	58796.804179
23	113812	99888.378505	101340.774522
25	105582	99888.378505	109418.091037
15	67938	66028.628587	66030.115014
18	81363	91775.156479	93940.000830
29	121872	99888.378505	122537.665812
5	56642	54619.305749	59532.025317

Visualizing: prediction intervals in Python

To better help understand the prediction values we got, we’ll plot the values on a graph.

import matplotlib.pyplot as plt
plt.plot(Xtest,pred["Salary"],'o',color='red')
plt.plot(Xtest,pred["lower quartile"],'o',color='blue')
plt.plot(Xtest,pred["upper quartile"],'o',color='green')

Output:

Prediction Intervals in Python using Machine learning