k-fold Cross-Validation in Machine Learning
Performance estimation is crucial for any model. Cross-validation method is one of the estimation strategies which improves the accuracy of the model. In this tutorial, you will learn how to train the model using k fold cross-validation.
k fold cross-validation:
- Loading packages
- Understanding the data
- User input (value for k)
- k-fold cross-validation
- Training the model
- Accuracy estimation
- In this method, the dataset is divided into k equal, mutually exclusive folds (D1, D2,.., Dk).
- A series of k runs are carried out with this decomposition by considering Di (ith iteration) as test data and remaining as train data.
- Accuracy is calculated for each iteration and overall accuracy will be their average.
import pandas as pd from sklearn.model_selection import KFold from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score from sklearn.datasets import load_breast_cancer
Loading the dataset:
Here we are considering breast cancer dataset which can be directly loaded from sklearn.
cancer_data = load_breast_cancer(as_frame = True)
df = cancer_data.frame X = df.iloc[:,:-1] y = df.iloc[:,-1] print(df.columns)
Here, the user needs to enter the value of k:
print("Enter the value of k") k = int(input())
Enter the value of k 4
Let’s assume k to be 4
kfold_val = KFold(n_splits=k, random_state=None)
This helps to divide the dataset into k((i.e) 4) equal and mutually exclusive folds.
Training and estimation:
To classify the data, we are using LogisticRegression as shown,
lr = LogisticRegression() accuracy_scores =  for i , j in kfold_val.split(X): X_train , X_test = X.iloc[i,:],X.iloc[j,:] y_train , y_test = y[i] , y[j] lr.fit(X_train,y_train) pred = lr.predict(X_test) accuracy = accuracy_score(pred , y_test) accuracy_scores.append(accuracy)
print("Accuracy score for each folds:") print(accuracy_scores)
Accuracy score for each folds: [0.916083916083916, 0.9436619718309859, 0.9647887323943662, 0.9295774647887324]
We got the accuracies for each fold. The final accuracy will be the average of the above accuracies as shown,
print("Overall accuracy:") print(sum(accuracy_scores) / k)
Overall accuracy: 0.9385280212745001
In this way, we achieved an accuracy of 93% by using k-fold cross-validation. I hope it might be helpful for you. Thank you!