# K-Nearest Neighbor Algorithm in Python

In this tutorial, we will learn about one of the supervised Machine learning techniques (i.e) K-Nearest Neighbor Algorithm from scratch in Python. It is mainly used for classification.

## K-Nearest Neighbor Algorithm

First, let’s see the working of the algorithm:

**Step-1:**Initially we need to select K.**Step-2:**Then we need to calculate the Euclidean distance of all its neighbors.**Step-3:**We need to sort the euclidean distances and consider k-Nearest neighbors and then take the count of class labels of this k – neighbors.**Step-4:**We then need to classify the point with the class label whose count is maximum.**Step-5:**Finally, Our model is ready.

**NOTE: **As K increases, the Classification boundary becomes smoother and Training error can increase

**Steps involved:**

- Importing packages
- Loading the dataset
- Data Preprocessing
- Finding the optimal value of K by using techniques like cross-validation
- Performing k-nn with the optimal k
- Accuracy estimation

**Importing packages:**

import sys import numpy as np import pandas as pd from os import X_OK from sklearn.metrics import accuracy_score, mean_squared_error from collections import Counter from sklearn.model_selection import KFold from time import sleep import matplotlib.pyplot as plt

**Importing train data: **Train data: Knn-train.dat

It is 192-dimensional data where each row is for a data element. In each row the last entry ie., the 193rd entry is the class label which is one among 0, 1, 2, ..,9.

with open('PATH OF THE FILE', 'rb') as f: data = f.read() s=str(data,'utf-8') data = [[np.asarray(temp[:-3].split(),dtype=np.float64, order='C'),int(temp[-2])] for temp in s.split('\n')[:-1]] df = pd.DataFrame(data,columns=['data','class']) f.close() x_train = df['data'] y_train = df['class']

**Importing test data: **Test data: Knn-test.dat

with open('PATH OF THE FILE', 'rb') as f: data = f.read() s=str(data,'utf-8') data = [[np.asarray(i[:-2].split(),dtype=np.float64, order='C'),int(i[-1])] for i in s.split('\n')[:-1]] tdf = pd.DataFrame(data,columns=['data','class']) f.close() x_test = df['data'] y_test = df['class']

**Function for euclidean distance:**

def fn_for_euclidean_distance(train,test): difference=test-train distance=np.sqrt(np.sum(difference**2)) return distance

**Knn from scratch: **In this way, for each test point, x, to be classified, find the K

nearest samples in the training data and classify the point, x, according to the majority vote of their class labels.

def fn_for_knn(x_test, x_train, y_train,possible_k,sample,method=''): count = 1 test_length = len(x_test) # Implementing to find best K for KNN algorithm if sample=='train': k=temp=1 y_pred=[[] for _ in range(possible_k)] # Implementing to test using best K elif sample=='test': k=temp=possible_k y_pred=[] for test in x_test: distances=[] for train in x_train: distance=fn_for_euclidean_distance(train,test) distances.append(distance) data={ 'Distances':distances, 'class':y_train } df_distance=pd.DataFrame(data) df_distance=df_distance.sort_values(by=['Distances'],axis=0) k=temp # For Modified KNN if(method=="modified"): while(k<=possible_k): df_k_distance = np.array(df_distance[:k]) weighted_k = fn_for_finding_weights(k,df_k_distance) votes = np.zeros(10,dtype=np.float32) for j in range(k): votes[np.int64(df_k_distance[j][1])]+=weighted_k[j] prediction=np.argmax(votes) if sample=='train': y_pred[k-1].append(prediction) elif sample=='test': y_pred.append(prediction) k+=1 else: while(k<=possible_k): df_k_distance=df_distance[:k] counter = Counter(y_train[df_k_distance.index]) prediction=counter.most_common()[0][0] if sample=='train': y_pred[k-1].append(prediction) elif sample=='test': y_pred.append(prediction) k+=1 count+=1 return y_pred

**Cross-validation: **For detailed info regarding cross-validation please refer to https://www.codespeedy.com/k-fold-cross-validation-in-machine-learning/

def fn_for_crossValidation(x_train,y_train,possible_k,method=''): y_pred=[] num_folds=5 kf=KFold(n_splits=num_folds,shuffle=True,random_state=42) accuracies=[0]*(possible_k) errors = [0]*(possible_k) temp=1 for train_index, test_index in kf.split(df): x_val=x_train[test_index] y_val=y_train[test_index] x_t=x_train[train_index] y_t=y_train[train_index] y_pred= fn_for_knn(x_val,x_t,y_t,possible_k,'train',method) print("Cross-Validation batch",temp,"completed") for k in range(possible_k): accuracies[k]+=accuracy_score(y_val, y_pred[k])/num_folds*100 errors[k]+=mean_squared_error(y_val, y_pred[k])/num_folds*100 temp+=1 return accuracies,errors; possible_k = 25 print("Cross-Validation STARTED\n") accuracies,errors = fn_for_crossValidation(x_train,y_train,possible_k) print("\nCross-Validation COMPLETED\n") for temp in range(possible_k): print('Validation Accuracy for k =',temp+1,"is:",accuracies[temp]) best_k = np.argmax(accuracies)+1 print("Best value for k:",best_k) print("Maximum Validation Accuracy(Validation Accuracy for Best k):",max(accuracies))

**Validation accuracy vs k: **For different values of k, let’s perform cross-validation as shown,

possible_val_k = [i for i in range(1,26)] plt.plot(possible_val_k, accuracies,color='black', linestyle='dotted', linewidth = 2.5, marker='*', markerfacecolor='green', markersize=9) plt.xlabel('K values',labelpad=15) plt.ylabel('Validation Accuracy',labelpad=20) plt.title('Validation Accuracy for different Values of K') plt.show()

From the above image, clearly, we can find the optimal value of k is 4 as it has high accuracy.

**Accuracy for the best value of k: **As the optical value of k is 4, the accuracy for k = 4 is given by,

print("Best value for k:",best_k) print("Maximum Validation Accuracy(Validation Accuracy for Best k):",max(accuracies))

Best value for k: 4 Maximum Validation Accuracy(Validation Accuracy for Best k): 92.56371814092952

**Accuracy for test data: **The accuracy for test data is,

y_hat_test=fn_for_knn(x_test,x_train,y_train,best_k,'test') print("\nTest Accuracy for k as",best_k,"is",accuracy_score(y_test, y_hat_test)*100)

Test Accuracy for k as 4 is 97.4512743628186

Hence, by performing cross-validation we achieved an accuracy of 97%.

I hope it might be helpful for you. Thank you!

## Leave a Reply