# K-Nearest Neighbor Algorithm in Python

In this tutorial, we will learn about one of the supervised Machine learning techniques (i.e) K-Nearest Neighbor Algorithm from scratch in Python. It is mainly used for classification.

## K-Nearest Neighbor Algorithm

First, let’s see the working of the algorithm:

• Step-1: Initially we need to select K.
• Step-2: Then we need to calculate the Euclidean distance of all its neighbors.
• Step-3: We need to sort the euclidean distances and consider k-Nearest neighbors and then take the count of class labels of this k – neighbors.
• Step-4: We then need to classify the point with the class label whose count is maximum.
• Step-5: Finally, Our model is ready.

NOTE:
As K increases, the Classification boundary becomes smoother and Training error can increase

Steps involved:

• Importing packages
• Data Preprocessing
• Finding the optimal value of K by using techniques like cross-validation
• Performing k-nn with the optimal k
• Accuracy estimation

Importing packages:

```import sys
import numpy as np
import pandas as pd
from os import X_OK
from sklearn.metrics import accuracy_score, mean_squared_error
from collections import Counter
from sklearn.model_selection import KFold
from time import sleep
import matplotlib.pyplot as plt```

Importing train data:
Train data: Knn-train.dat
It is 192-dimensional data where each row is for a data element. In each row the last entry ie., the 193rd entry is the class label which is one among 0, 1, 2, ..,9.

```with open('PATH OF THE FILE', 'rb') as f:
s=str(data,'utf-8')
data = [[np.asarray(temp[:-3].split(),dtype=np.float64, order='C'),int(temp[-2])] for temp in s.split('\n')[:-1]]
df = pd.DataFrame(data,columns=['data','class'])
f.close()
x_train = df['data']
y_train = df['class']```

Importing test data:
Test data: Knn-test.dat

```with open('PATH OF THE FILE', 'rb') as f:
s=str(data,'utf-8')
data = [[np.asarray(i[:-2].split(),dtype=np.float64, order='C'),int(i[-1])] for i in s.split('\n')[:-1]]
tdf = pd.DataFrame(data,columns=['data','class'])
f.close()
x_test = df['data']
y_test = df['class']```

Function for euclidean distance:

```def fn_for_euclidean_distance(train,test):

difference=test-train
distance=np.sqrt(np.sum(difference**2))
return distance```

Knn from scratch:
In this way, for each test point, x, to be classified, find the K
nearest samples in the training data and classify the point, x, according to the majority vote of their class labels.

```def fn_for_knn(x_test, x_train, y_train,possible_k,sample,method=''):

count = 1
test_length = len(x_test)

# Implementing to find best K for KNN algorithm
if sample=='train':
k=temp=1
y_pred=[[] for _ in range(possible_k)]
# Implementing to test using best K
elif sample=='test':
k=temp=possible_k
y_pred=[]
for test in x_test:
distances=[]

for train in x_train:
distance=fn_for_euclidean_distance(train,test)
distances.append(distance)

data={
'Distances':distances,
'class':y_train
}

df_distance=pd.DataFrame(data)
df_distance=df_distance.sort_values(by=['Distances'],axis=0)

k=temp

# For Modified KNN
if(method=="modified"):
while(k<=possible_k):
df_k_distance = np.array(df_distance[:k])
weighted_k = fn_for_finding_weights(k,df_k_distance)

for j in range(k):

if sample=='train':
y_pred[k-1].append(prediction)
elif sample=='test':
y_pred.append(prediction)

k+=1

else:
while(k<=possible_k):
df_k_distance=df_distance[:k]
counter = Counter(y_train[df_k_distance.index])
prediction=counter.most_common()[0][0]

if sample=='train':
y_pred[k-1].append(prediction)
elif sample=='test':
y_pred.append(prediction)

k+=1

count+=1

return y_pred```

Cross-validation:
For detailed info regarding cross-validation please refer to https://www.codespeedy.com/k-fold-cross-validation-in-machine-learning/

```def fn_for_crossValidation(x_train,y_train,possible_k,method=''):
y_pred=[]

num_folds=5
kf=KFold(n_splits=num_folds,shuffle=True,random_state=42)

accuracies=[0]*(possible_k)
errors = [0]*(possible_k)
temp=1
for train_index, test_index in kf.split(df):

x_val=x_train[test_index]
y_val=y_train[test_index]
x_t=x_train[train_index]
y_t=y_train[train_index]

y_pred= fn_for_knn(x_val,x_t,y_t,possible_k,'train',method)
print("Cross-Validation batch",temp,"completed")

for k in range(possible_k):
accuracies[k]+=accuracy_score(y_val, y_pred[k])/num_folds*100
errors[k]+=mean_squared_error(y_val, y_pred[k])/num_folds*100
temp+=1
return accuracies,errors;

possible_k = 25
print("Cross-Validation STARTED\n")
accuracies,errors = fn_for_crossValidation(x_train,y_train,possible_k)
print("\nCross-Validation COMPLETED\n")
for temp in range(possible_k):
print('Validation Accuracy for k =',temp+1,"is:",accuracies[temp])

best_k = np.argmax(accuracies)+1
print("Best value for k:",best_k)
print("Maximum Validation Accuracy(Validation Accuracy for Best k):",max(accuracies))```

Validation accuracy vs k:
For different values of k, let’s perform cross-validation as shown,

```possible_val_k = [i for i in range(1,26)]
plt.plot(possible_val_k, accuracies,color='black', linestyle='dotted', linewidth = 2.5,
marker='*', markerfacecolor='green', markersize=9)

plt.title('Validation Accuracy for different Values of K')
plt.show()```

From the above image, clearly, we can find the optimal value of k is 4 as it has high accuracy.

Accuracy for the best value of k:
As the optical value of k is 4, the accuracy for k = 4 is given by,

```print("Best value for k:",best_k)
print("Maximum Validation Accuracy(Validation Accuracy for Best k):",max(accuracies))```
```Best value for k: 4
Maximum Validation Accuracy(Validation Accuracy for Best k): 92.56371814092952```

Accuracy for test data:
The accuracy for test data is,

```y_hat_test=fn_for_knn(x_test,x_train,y_train,best_k,'test')
print("\nTest Accuracy for k as",best_k,"is",accuracy_score(y_test, y_hat_test)*100)```
`Test Accuracy for k as 4 is 97.4512743628186`

Hence, by performing cross-validation we achieved an accuracy of 97%.
I hope it might be helpful for you. Thank you!