Prediction Intervals in Python using Machine learning
Today we’ll learn about getting Prediction intervals in Python using Machine learning. The prediction that we get in every machine learning algorithm consists of some errors. The approximate range in which our prediction can lie is called the prediction interval. Here, the library we’re using for prediction intervals in Python using machine learning is Scikit-Learn.
So let’s get coding!
Getting dataset for prediction intervals in Python
First, we import pandas and read the .csv file of some dataset. Get the dataset I used from this link. And then take a look at a snippet of the dataset using the df.head() method. In the current dataset, our aim is to predict the salary of an employee from his years of experience.
import pandas as pd df=pd.read_csv("Salary.csv") df.head()
Output:
YearsExperience | Salary | |
---|---|---|
0 | 1.1 | 39343 |
1 | 1.3 | 46205 |
2 | 1.5 | 37731 |
3 | 2.0 | 43525 |
4 | 2.2 | 39891 |
Splitting the dataset
We now have to split the dataset into training and testing data. We make use of the train_test_split() method of the sklearn.model_selection module.
from sklearn.model_selection import train_test_split Xtrain,Xtest,ytrain,ytest=train_test_split(df["YearsExperience"],df["Salary"],test_size=0.2)
Building the model
Now, we need to train our model. We will use the GradientBoostingRegressor() function from the sklearn.ensemble module. To know more about GradientBoostingRegressor(), visit its documentation. We’re defining two models, the upper quantile and the lower quantile.
from sklearn.ensemble import GradientBoostingRegressor m1 = GradientBoostingRegressor(loss="quantile",alpha=0.1) m2 = GradientBoostingRegressor(loss="quantile",alpha=0.6)
We then fit these two models with the training data. We first need to convert the data which is in pandas DataFrame to a numpy array. After this, we also need to reshape the array to a 2D array.
import numpy as np m1.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1))) m2.fit(np.reshape(np.array(Xtrain),(-1,1)),np.reshape(np.array(ytrain),(-1,1)))
Testing the model
At this point, we have completed the training of our models. Now, let’s test our model. We create a new DataFrame “pred” which has the actual salary ie. our target values. Then we add the lower quantile and upper quantile values that we predicted.
pred=pd.DataFrame(ytest) #Actual value pred["lower quartile"]=m1.predict(np.reshape(np.array(Xtest),(-1,1))) pred["upper quartile"]=m2.predict(np.reshape(np.array(Xtest),(-1,1)))
pred
Output:
Salary | lower quartile | upper quartile | |
---|---|---|---|
11 | 55794 | 56920.534822 | 58796.804179 |
23 | 113812 | 99888.378505 | 101340.774522 |
25 | 105582 | 99888.378505 | 109418.091037 |
15 | 67938 | 66028.628587 | 66030.115014 |
18 | 81363 | 91775.156479 | 93940.000830 |
29 | 121872 | 99888.378505 | 122537.665812 |
5 | 56642 | 54619.305749 | 59532.025317 |
Visualizing: prediction intervals in Python
To better help understand the prediction values we got, we’ll plot the values on a graph.
import matplotlib.pyplot as plt plt.plot(Xtest,pred["Salary"],'o',color='red') plt.plot(Xtest,pred["lower quartile"],'o',color='blue') plt.plot(Xtest,pred["upper quartile"],'o',color='green')
Output:
Also, check out other machine learning programs:
KNN Classification using Scikit-Learn in Python
Predicting insurance using Scikit-Learn in Python
Predicting next number in a sequence with Scikit-Learn in Python
Leave a Reply