K Fold Cross Validation without using sklearn in Python
In this tutorial, we will learn how to perform K fold cross validation without using sklearn in Python.
I will guide you through the cross-validation technique, mostly used in machine learning. We will learn the need for this technique and the very famous K Fold Cross Validation method. However, the direct functions are available in scikit
library. We will be doing it without the help of it, which will help us better understand the concepts and workings of the method.
What is the Cross Validation Technique?
Consider that you have a dataset on which you want to apply any Machine Learning model. The very first thing you will do after cleaning the dataset is split your dataset into the training set and the test set, which is mostly done in the ratio of 70:30 or 80:20. Here comes the very first problem. This dataset splitting is random, meaning every time you run the code, your training and test dataset will differ, leading to different accuracy from the model. This is often controlled by putting the random state during splitting, but different random states give different accuracy, and now you want to validate your model. After all, you want your model not to depend on the randomness of splitting. Here comes the need for the Cross Validation Technique.
The Cross Validation Technique is used to assess the accuracy of our Machine Learning model by training on different subsets of the dataset for a certain number of iterations. The final result will be the average of the result of all the iterations.
K Fold Cross Validation
In this method, the dataset is divided into k subsets. The model is trained and evaluated k times, each time using a different subset of the dataset as a training set. This ensures that every subset of the dataset is used effectively.
Suppose the length of your dataset is n. Now you divide it into n/k = m subsets, and these m subsets are taken as test dataset in the first iteration, next ‘m’ subsets for the second iteration, and the process continues till k iterations. We call the number of iterations as the number of folds. Each m subset is considered as a test dataset, while the rest m-1 subsets are considered as training dataset.
For example: Consider an array of numbers from 1 to 9 in ascending order, and we use k fold. Here, n will be 9, and let k= 3
Your train and test data should look like this:
[4,5,6,7,8,9] [1,2,3] [7,8,9,1,2,3] [4,5,6] [1,2,3,4,5,6] [7,8,9]
Now you know what to do, let’s implement our learning in python.
Step 1: Importing Libraries
I will use sklearn
library to download the iris
dataset and import Logistic Regression
Model.
import numpy as np from sklearn import datasets from sklearn.linear_model import LogisticRegression
Step 2: Loading the Dataset
Download and load the iris
dataset from sklearn
.
iris = datasets.load_iris() X,y = iris.data, iris.target
Step 3: Setting ‘n’ and ‘k’ values
I am taking k =5 and defining accuracies
as a list to store the model’s accuracy in different folds. Also, I am defining indices
and shuffling them to shuffle my dataset according to the indices array.
n = len(X) k = 5 nfold = n//k accuracies = [] indices = np.arange(n) np.random.shuffle(indices)
Step 4: Load the model
Load the Logistic Regression model and define calcaccu
function to calculate the accuracy of our model by returning the fraction of correctly predicted classes by our model.
model = LogisticRegression() def calcaccu(y_true, y_pred): return np.mean(y_true == y_pred)
Step 5: K fold Cross Validation without sklearn
Write in code what I said earlier. Divide the dataset into training and test set for k iterations, each having nfold
as the length of the subset.
for i in range(k): testind = indices[i * nfold: (i + 1) * nfold] trainind = np.concatenate([indices[:i * nfold], indices[(i + 1) * nfold:]]) X_train, y_train = X[trainind], y[trainind] X_test, y_test = X[testind], y[testind] model.fit(X_train, y_train) predictions = model.predict(X_test) accuracy = calcaccu(y_test, predictions) accuracies.append(accuracy) avg_accuracy = np.mean(accuracies) print(avg_accuracy)
Output:
0.9600000000000001
Leave a Reply