KNN Classification using Scikit-Learn in Python

Today we’ll learn KNN Classification using Scikit-learn in Python.
KNN stands for K Nearest Neighbors. The KNN Algorithm can be used for both classification and regression problems. KNN algorithm assumes that similar categories lie in close proximity to each other.

Thus, when an unknown input is encountered, the categories of all the known inputs in its proximity are checked. The category/class with the most count is defined as the class for the unknown input.

The algorithm first calculates the distances between the unknown point and all the points in the graph. It then takes the closest k points. The value of k can be determined by us. The categories of these k points then determine the category of our unknown point.

So let’s start coding!

Importing Libraries:

The first library that we import from sklearn is our dataset that we are going to work with. I chose the wine dataset because it is great for a beginner. You can also look at the datasets provided by sklearn or import your own dataset.

The next import is the train_test_split to split the dataset we got to a testing set and a training set.
Following this, we’ll import the KNN library itself.
Lastly, we import the accuracy_score to check the accuracy of our KNN model.

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

Loading the dataset:

Now after finishing importing our libraries, we load our dataset. Our dataset can be loaded by calling “load_<dataset_name>()” and creating a bunch object. In this case, our bunch object is “wine”.

wine=load_wine()

We can now check the sample data and shape of the data present in wine bunch object using wine.data and wine.shape respectively.

print(wine.data)
print(wine.data.shape)

Output:

[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]
(178, 13)

Now we know that our data consists of 178 entries and 13 columns. The columns are called features that decide the corresponding input belongs to which class. The class here is called a target. So, we can now check the targets, target names and feature names.

print(wine.target)
print(wine.target_names)
print(wine.feature_names)

Output:

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]
['class_0' 'class_1' 'class_2']
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']

We notice that all the data inputs are divided into three classes: class 0, class 1, and class 2.

Splitting the data to training-set and testing-set:

Now it is time for us to split our data into a testing set and a training set. This step is optional. You can use the whole data to train the model. But, you cannot know the accuracy of our model when working with unknown data.
So, we put the data in the X variable and targets in y variable. We then split the data and target to testing set and training set. The test_size parameter is used to determine the percentage of data that is used for testing. Now, we can check the shape of the training set and testing set.

X=wine.data
y=wine.target
Xtrain,Xtest,ytrain,ytest=train_test_split(X,y,test_size=0.2)
print(Xtrain.shape)
print(Xtest.shape)

Output:

(142, 13)
(36, 13)

Applying the KNN Algorithm:

Now that we have split the data, we are ready to train the model. Since we are using the KNN algorithm we first create a KNeighborClassifier object. For more information on this class visit its documentation.

Then we use the fit() method to train the model using the training data. Then we move on to test the model using testing data. For this, we use the predict method and store the predicted targets in the yprediction variable. Now we get the accuracy of our prediction by comparing the predicted targets with the testing targets.

 

Also read:

 

We have taken k=7. You can experiment with different values of k and check at what value of k you get the best accuracy.

k=7
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(Xtrain,ytrain)
yprediction = knn.predict(Xtest)
print("accuracy= ",accuracy_score(ytest,yprediction))

Output:

accuracy=  0.8055555555555556

We have got an accuracy of 0.8o5 which is pretty good!

If you want the graph of k values vs Accuracies for this dataset look at the plot below:

kvalues

Predicting the target/class using a random user input:

Now we can give our model an unknown input and check its target class. We have used the random combination [3,4,1,3,100,1,4,0.3,2,12,1,1,400] and got the target as ‘Class 1’ wine.

x_user=[[3,4,1,3,100,1,4,0.3,2,12,1,1,400]]
y_user=knn.predict(x_user)
print("Class: ",wine.target_names[y_user])

Output:

Class:  ['class_1']

Try some inputs of your own and check out their targets. Now that you know how to train a KNN Classifier, you can run this program on different datasets too.

 

Leave a Reply

Your email address will not be published. Required fields are marked *