K-Nearest Neighbor Algorithm

Introduction

K Nearest Neighbor is a simple alogorithm that stores all the available cases and classifies the new data or case based on similarity measures.

for example apple belongs to banana or grapes which are belongs to fruits. apple cann't be near to monkey or elephant. So finding out similarity based on the need. Real example when we watching out a product in amazon, it will try to show similar product for more convenience. That is internally used kNN algorithm.

KNN is extremely easy to implement in its most basic form, and yet performs quite complex classification tasks. It is a lazy learning algorithm since it doesn't have a specialized training phase. Rather, it uses all of the data for training while classifying a new data point or instance.

k-NN is a memory-based approach is that the classifier immediately adapts as we collect new training data. The computational complexity for classifying new samples grows linearly with the number of samples in the training dataset in the worst-case scenario.

The training phase of the algorithm consists only of storing the feature vectors and class labels of the training samples. In the testing phase, a test point is classified by assigning the label which are most frequent among the k training samples nearest to that query point - hence higher computation.

How does the KNN algorithm work?

Here Class A nad Class B are two different products. based on the k we need to find star is belongs to Class A / Class B.
KNN alogirthm at point 3

k=3, 2 orange and 1 blue are nearest point star. So star belongs to orange nothing but star is belongs to Class B.
we can find out the distance between two points using Euclidean distance. Both Euclidean and Manhattan distances are used in case of continuous variables, whereas hamming distance is used in case of categorical variable.
Euclidean distance

KNN alogirthm at point 6

@ k=6, 4 blue, 2 orange. Here star belongs to Class A.
In case of very large value of k, we may include points from other classes into the neighborhood.
In case of too small value of k the algorithm is very sensitive to noise.

Pros and Cons of KNN

In this section we'll present some of the pros and cons of using the KNN algorithm.

Pros

  • It is extremely easy to implement
  • As said earlier, it is lazy learning algorithm and therefore requires no training prior to making real time predictions. This makes the KNN algorithm much faster than other algorithms that require training e.g SVM, linear regression, etc.
  • Since the algorithm requires no training before making predictions, new data can be added seamlessly.
  • There are only two parameters required to implement KNN i.e. the value of K and the distance function (e.g. Euclidean or Manhattan etc.)
  • The boundary becomes smoother with increasing value of K
  • The training time for any value of k in kNN algorithm is the same. 1-NN ~ 2-NN ~ 3-NN
  • We can also use k-NN for regression problems. In this case the prediction can be based on the mean or the median of the k-most similar instances.

Cons

  • The KNN algorithm doesn't work well with high dimensional data because with large number of dimensions, it becomes difficult for the algorithm to calculate distance in each dimension.
  • The KNN algorithm has a high prediction cost for large datasets. This is because in large datasets the cost of calculating distance between new point and each existing point becomes higher. Finally, the KNN algorithm doesn't work well with categorical features since it is difficult to find the distance between dimensions with categorical features.
  • We have to ensure that the value of k is not too high or not too low.