Linear Regression from scratch in Python

In this tutorial, we will implement a linear regression algorithm from scratch in Python without using any inbuilt libraries.

We know that in linear regression we find the relationship between the input independent variable and output dependent variable. This algorithm is used when output is varying linearly with input. In this tutorial, we will consider the simple case of input(Hours of study) and the output(the marks obtained) and predict the output for new input data using linear regression algorithm.

Mathematical Background

For the simplest form of a linear regression model with one dependent and one independent variable we can write,

y = a*x+b

where,

  • y -> dependent variable
  • x -> independent variable
  • a -> constant, b -> regression constant

So, we can find the values of a and b from the given data using the formulas,

a = sum((x-mean(x))*(y-mean(y)))/sum(x-mean(x))**2

b = mean(y) – a*mean(x)

We can find a,b using the least squares method. The complete derivation can be found here. Now, we are set for step-by-step implementation of linear regression algorithm using the above formulas in Python.

1. Importing Libraries

import pandas as pd
import numpy as np  
import matplotlib.pyplot as plt

2. Importing the dataset

Let’s import the data set and split them into test and train data.

data = pd.read_csv("data.csv")
# converting column of data frame to list 
x = data["Hours"].values.tolist()  
y = data["Marks"].values.tolist()
# splitting them to test and train data
X_train = x[:20]
Y_train = y[:20]
X_test = x[20:]
Y_test = y[20:]
data.head()

Output:

HoursMarks
02.521
15.147
23.227
38.575
43.530

 

The dataset can be found here: data

3. Creating the model

Let’s code out the linear regression algorithm with reference to the above equations.

# to find mean of the list input
def mean(x):
    return sum(x)/len(x)

# To find the coefficients a,b for the given train data
def Coefficients(X_train,Y_train):
    '''
    input: X_train, Y_trains lists
    output : a,b integers
    '''
    N = len(X_train)
    nr = 0
    dr = 0
    for i in range(N):
        nr += (X_train[i]-mean(X_train))*(Y_train[i]-mean(Y_train))
        dr += (X_train[i]-mean(X_train))**2
    a = nr/dr
    b = mean(Y_train)-a*mean(X_train)
    return a,b
    
# Predicts the output for the test input    
def predict(X_train,Y_train,X_test):
    '''
    input: X_train, Y_trains, X_test lists
    output : Y_pred list
    '''
    a,b = Coefficients(X_train,Y_train)
    Y_pred = []
    for val in X_test:
        y_pred = (a*val)+b
        Y_pred.append(y_pred)
    return Y_pred

# Prediciting output for test input
Y_pred = predict(X_train,Y_train,X_test)
out = {'Y_test':Y_test,'Y_pred':Y_pred}
predict_out = pd.DataFrame(out)

# Displaying output as DataFrame for better comparision between Y_test and Y_pred
predict_out

Output:

Y_test  Y_pred
30      28.736325
54      48.729136
35      39.208750
76      68.721947
86      77.290295

4. Predicting for new input data

Suppose that we wanted to know how many marks the student would get if he studies for 9 hrs. For that, we can use our model to predict how many marks he would get. Let’s see the implementation below.

# predicting when input is 9 hrs
print(predict(X_train,Y_train,[9]))

Output:

[88.71475789309682]

From the above, we can observe that the student is likely to get 88.7 marks if he studies for 9 hours.

5. Calculating the RSME Error

Now, we will calculate the root square mean error (RSME) for our model using the formula below.

RSME = sqrt(sum((Y_test-Y_pred)**2)/N)

# Function to calculate RSME error
def RSME(Y_test,Y_pred):
    n = len(Y_test)
    rsme = 0
    for i in range(n):
        rsme += (Y_test[i]-Y_pred[i])**2
    return np.sqrt(rsme/n)

print('RSME Error : ',RSME(Y_test,Y_pred))

Output:

RSME Error :  5.931635159442718

6. Plotting the regression line

Now, we will plot the regression line for the test input and compare with it’s actual output graphically.

# plotting Y_test vs X_test
plt.scatter(X_test,Y_test,color='r', label='Actual')
# plotting regression line
plt.plot(X_test,Y_pred,color='b', label='Predicted')
plt.grid()
plt.xlabel("Hours")
plt.ylabel("Marks")
plt.title("Linear Regression")
plt.legend()
plt.show()

Output plot:

plot_linearReg

Finally in this tutorial, we completely implemented linear regression from scratch without using any inbuilt modules and calculated its root mean square error.

Leave a Reply

Your email address will not be published. Required fields are marked *