k-fold Cross-Validation in Machine Learning

Performance estimation is crucial for any model. Cross-validation method is one of the estimation strategies which improves the accuracy of the model. In this tutorial, you will learn how to train the model using k fold cross-validation.

k fold cross-validation:

Steps involved:

  1. Loading packages
  2. Understanding the data
  3. User input (value for k)
  4. k-fold cross-validation
  5. Training the model
  6. Accuracy estimation

Working:

  • In this method, the dataset is divided into k equal, mutually exclusive folds (D1, D2,.., Dk).
  • A series of k runs are carried out with this decomposition by considering Di (ith iteration) as test data and remaining as train data.
  • Accuracy is calculated for each iteration and overall accuracy will be their average.

Loading packages:

import pandas as pd
from sklearn.model_selection import KFold 
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.datasets import load_breast_cancer

Loading the dataset: 
Here we are considering breast cancer dataset which can be directly loaded from sklearn.

cancer_data = load_breast_cancer(as_frame = True)
df = cancer_data.frame
X = df.iloc[:,:-1]
y = df.iloc[:,-1]
print(df.columns)

User Input:
Here, the user needs to enter the value of k:

print("Enter the value of k")
k = int(input())
Enter the value of k
4

Let’s assume k to be 4

k-fold cross-validation:

kfold_val = KFold(n_splits=k, random_state=None)

This helps to divide the dataset into k((i.e) 4) equal and mutually exclusive folds.

Training and estimation: 

To classify the data, we are using LogisticRegression as shown,

lr = LogisticRegression()
accuracy_scores = []
for i , j in kfold_val.split(X):
    X_train , X_test = X.iloc[i,:],X.iloc[j,:]
    y_train , y_test = y[i] , y[j]
    lr.fit(X_train,y_train)
    pred = lr.predict(X_test)
     
    accuracy = accuracy_score(pred , y_test)
    accuracy_scores.append(accuracy)
print("Accuracy score for each folds:")
print(accuracy_scores)
Accuracy score for each folds:
[0.916083916083916, 0.9436619718309859, 0.9647887323943662, 0.9295774647887324]

We got the accuracies for each fold. The final accuracy will be the average of the above accuracies as shown,

print("Overall accuracy:")
print(sum(accuracy_scores) / k)
Overall accuracy: 0.9385280212745001

In this way, we achieved an accuracy of 93% by using k-fold cross-validation. I hope it might be helpful for you. Thank you!

Leave a Reply

Your email address will not be published.