Prediction Intervals in Python using Machine learning

Today we’ll learn about getting Prediction intervals in Python using Machine learning. The prediction that we get in every machine learning algorithm consists of some errors. The approximate range in which our prediction can lie is called the prediction interval. Here, the library we’re using for prediction intervals in Python using machine learning is Scikit-Learn.

So let’s get coding!

Getting dataset for prediction intervals in Python

First, we import pandas and read the .csv file of some dataset. Get the dataset I used from this link. And then take a look at a snippet of the dataset using the df.head() method. In the current dataset, our aim is to predict the salary of an employee from his years of experience.

import pandas as pd
df=pd.read_csv("Salary.csv")
df.head()

Output:

YearsExperienceSalary
01.139343
11.346205
21.537731
32.043525
42.239891

 

Splitting the dataset

We now have to split the dataset into training and testing data. We make use of the train_test_split() method of the sklearn.model_selection module.

from sklearn.model_selection import train_test_split
Xtrain,Xtest,ytrain,ytest=train_test_split(df["YearsExperience"],df["Salary"],test_size=0.2)

Building the model

Now, we need to train our model. We will use the GradientBoostingRegressor() function from the sklearn.ensemble module. To know more about GradientBoostingRegressor(), visit its documentation. We’re defining two models, the upper quantile and the lower quantile.

from sklearn.ensemble import GradientBoostingRegressor
m1 = GradientBoostingRegressor(loss="quantile",alpha=0.1)
m2 = GradientBoostingRegressor(loss="quantile",alpha=0.6)

We then fit these two models with the training data. We first need to convert the data which is in pandas DataFrame to a numpy array. After this, we also need to reshape the array to a 2D array.

import numpy as np
m1.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1)))
m2.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1)))

Testing the model

At this point, we have completed the training of our models. Now, let’s test our model. We create a new DataFrame “pred” which has the actual salary ie. our target values. Then we add the lower quantile and upper quantile values that we predicted.

pred=pd.DataFrame(ytest)      #Actual value
pred["lower quartile"]=m1.predict(np.reshape(np.array(Xtest),(-1,1)))
pred["upper quartile"]=m2.predict(np.reshape(np.array(Xtest),(-1,1)))
pred

Output:

Salarylower quartileupper quartile
115579456920.53482258796.804179
2311381299888.378505101340.774522
2510558299888.378505109418.091037
156793866028.62858766030.115014
188136391775.15647993940.000830
2912187299888.378505122537.665812
55664254619.30574959532.025317

Visualizing: prediction intervals in Python

To better help understand the prediction values we got, we’ll plot the values on a graph.

import matplotlib.pyplot as plt
plt.plot(Xtest,pred["Salary"],'o',color='red')
plt.plot(Xtest,pred["lower quartile"],'o',color='blue')
plt.plot(Xtest,pred["upper quartile"],'o',color='green')

Output:

Prediction Intervals in Python using Machine learning

Also, check out other machine learning programs:

KNN Classification using Scikit-Learn in Python

Predicting insurance using Scikit-Learn in Python

Predicting next number in a sequence with Scikit-Learn in Python

Image Classification using Keras in TensorFlow Backend

Leave a Reply

Your email address will not be published. Required fields are marked *