K Fold Cross Validation without using sklearn in Python

Post Views: 1,093

In this tutorial, we will learn how to perform K fold cross validation without using sklearn in Python.

I will guide you through the cross-validation technique, mostly used in machine learning. We will learn the need for this technique and the very famous K Fold Cross Validation method. However, the direct functions are available in scikit library. We will be doing it without the help of it, which will help us better understand the concepts and workings of the method.

What is the Cross Validation Technique?

Consider that you have a dataset on which you want to apply any Machine Learning model. The very first thing you will do after cleaning the dataset is split your dataset into the training set and the test set, which is mostly done in the ratio of 70:30 or 80:20. Here comes the very first problem. This dataset splitting is random, meaning every time you run the code, your training and test dataset will differ, leading to different accuracy from the model. This is often controlled by putting the random state during splitting, but different random states give different accuracy, and now you want to validate your model. After all, you want your model not to depend on the randomness of splitting. Here comes the need for the Cross Validation Technique.

The Cross Validation Technique is used to assess the accuracy of our Machine Learning model by training on different subsets of the dataset for a certain number of iterations. The final result will be the average of the result of all the iterations.

K Fold Cross Validation

In this method, the dataset is divided into k subsets. The model is trained and evaluated k times, each time using a different subset of the dataset as a training set. This ensures that every subset of the dataset is used effectively.
Suppose the length of your dataset is n. Now you divide it into n/k = m subsets, and these m subsets are taken as test dataset in the first iteration, next ‘m’ subsets for the second iteration, and the process continues till k iterations. We call the number of iterations as the number of folds. Each m subset is considered as a test dataset, while the rest m-1 subsets are considered as training dataset.

For example: Consider an array of numbers from 1 to 9 in ascending order, and we use k fold. Here, n will be 9, and let k= 3
Your train and test data should look like this:

[4,5,6,7,8,9] [1,2,3]
[7,8,9,1,2,3] [4,5,6]
[1,2,3,4,5,6] [7,8,9]

Now you know what to do, let’s implement our learning in python.

Step 1: Importing Libraries

I will use sklearn library to download the iris dataset and import Logistic Regression Model.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression

Step 2: Loading the Dataset

Download and load the iris dataset from sklearn.

iris = datasets.load_iris()

X,y = iris.data, iris.target

Step 3: Setting ‘n’ and ‘k’ values

I am taking k =5 and defining accuracies as a list to store the model’s accuracy in different folds. Also, I am defining indices and shuffling them to shuffle my dataset according to the indices array.

n = len(X)
k = 5
nfold = n//k

accuracies = []

indices = np.arange(n)

np.random.shuffle(indices)

Step 4: Load the model

Load the Logistic Regression model and define calcaccu function to calculate the accuracy of our model by returning the fraction of correctly predicted classes by our model.

model = LogisticRegression()

def calcaccu(y_true, y_pred):
    return np.mean(y_true == y_pred)

Step 5: K fold Cross Validation without sklearn

Write in code what I said earlier. Divide the dataset into training and test set for k iterations, each having nfold as the length of the subset.

for i in range(k):
    testind = indices[i * nfold: (i + 1) * nfold]
    trainind = np.concatenate([indices[:i * nfold], indices[(i + 1) * nfold:]])
    X_train, y_train = X[trainind], y[trainind]
    X_test, y_test = X[testind], y[testind]

    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    accuracy = calcaccu(y_test, predictions)
    accuracies.append(accuracy)

avg_accuracy = np.mean(accuracies)

print(avg_accuracy)

Output:

0.9600000000000001