Linear Regression from scratch in Python
In this tutorial, we will implement a linear regression algorithm from scratch in Python without using any inbuilt libraries.
We know that in linear regression we find the relationship between the input independent variable and output dependent variable. This algorithm is used when output is varying linearly with input. In this tutorial, we will consider the simple case of input(Hours of study) and the output(the marks obtained) and predict the output for new input data using linear regression algorithm.
Mathematical Background
For the simplest form of a linear regression model with one dependent and one independent variable we can write,
y = a*x+b
where,
- y -> dependent variable
- x -> independent variable
- a -> constant, b -> regression constant
So, we can find the values of a and b from the given data using the formulas,
a = sum((x-mean(x))*(y-mean(y)))/sum(x-mean(x))**2
b = mean(y) – a*mean(x)
We can find a,b using the least squares method. The complete derivation can be found here. Now, we are set for step-by-step implementation of linear regression algorithm using the above formulas in Python.
1. Importing Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt
2. Importing the dataset
Let’s import the data set and split them into test and train data.
data = pd.read_csv("data.csv") # converting column of data frame to list x = data["Hours"].values.tolist() y = data["Marks"].values.tolist() # splitting them to test and train data X_train = x[:20] Y_train = y[:20] X_test = x[20:] Y_test = y[20:] data.head()
Output:
Hours | Marks | |
---|---|---|
0 | 2.5 | 21 |
1 | 5.1 | 47 |
2 | 3.2 | 27 |
3 | 8.5 | 75 |
4 | 3.5 | 30 |
The dataset can be found here: data
3. Creating the model
Let’s code out the linear regression algorithm with reference to the above equations.
# to find mean of the list input def mean(x): return sum(x)/len(x) # To find the coefficients a,b for the given train data def Coefficients(X_train,Y_train): ''' input: X_train, Y_trains lists output : a,b integers ''' N = len(X_train) nr = 0 dr = 0 for i in range(N): nr += (X_train[i]-mean(X_train))*(Y_train[i]-mean(Y_train)) dr += (X_train[i]-mean(X_train))**2 a = nr/dr b = mean(Y_train)-a*mean(X_train) return a,b # Predicts the output for the test input def predict(X_train,Y_train,X_test): ''' input: X_train, Y_trains, X_test lists output : Y_pred list ''' a,b = Coefficients(X_train,Y_train) Y_pred = [] for val in X_test: y_pred = (a*val)+b Y_pred.append(y_pred) return Y_pred # Prediciting output for test input Y_pred = predict(X_train,Y_train,X_test) out = {'Y_test':Y_test,'Y_pred':Y_pred} predict_out = pd.DataFrame(out) # Displaying output as DataFrame for better comparision between Y_test and Y_pred predict_out
Output:
Y_test Y_pred 30 28.736325 54 48.729136 35 39.208750 76 68.721947 86 77.290295
4. Predicting for new input data
Suppose that we wanted to know how many marks the student would get if he studies for 9 hrs. For that, we can use our model to predict how many marks he would get. Let’s see the implementation below.
# predicting when input is 9 hrs print(predict(X_train,Y_train,[9]))
Output:
[88.71475789309682]
From the above, we can observe that the student is likely to get 88.7 marks if he studies for 9 hours.
5. Calculating the RSME Error
Now, we will calculate the root square mean error (RSME) for our model using the formula below.
RSME = sqrt(sum((Y_test-Y_pred)**2)/N)
# Function to calculate RSME error def RSME(Y_test,Y_pred): n = len(Y_test) rsme = 0 for i in range(n): rsme += (Y_test[i]-Y_pred[i])**2 return np.sqrt(rsme/n) print('RSME Error : ',RSME(Y_test,Y_pred))
Output:
6. Plotting the regression line
Now, we will plot the regression line for the test input and compare with it’s actual output graphically.
# plotting Y_test vs X_test plt.scatter(X_test,Y_test,color='r', label='Actual') # plotting regression line plt.plot(X_test,Y_pred,color='b', label='Predicted') plt.grid() plt.xlabel("Hours") plt.ylabel("Marks") plt.title("Linear Regression") plt.legend() plt.show()
Output plot:
Finally in this tutorial, we completely implemented linear regression from scratch without using any inbuilt modules and calculated its root mean square error.
Leave a Reply