CS460/660- Machine Learning

K-Nearest Neighbour | 13th October 2020

Debaiudh Das, 5th Yr. Int. MSc., School of Physical Sciences, NISER | debaiudh.das@niser.ac.in
[Instructor: Dr. Subhankar Mishra, Asst. Prof., School of Computer Sciences (CSS), NISER]

K-Nearest Neighbour

Jump to:
    Introduction
    A Problem
    Pros and Cons
    Hyperparameters
    Computational Analysis

Introduction

In the extremeties of a classification problem, one may either possess "the complete statistical knowledge of the underlying joint distribut’ion of the observation and the true category, or he may have no knowledge of the underlying distribution except that which can be inferred from samples." The former condition is ideally suitable for applying the standard Baye's analysis to yield a decision result (with a minimum error), while the latter can be analysed using non parametric statistical methods to yield decision rules. The nearest neighbour approach is one such non parametric statistical analysis which yields error in the range of values less than twice that of the Baye's error and hence it is a very ideal method inspite of being non-parametric.

K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbours are classified.

Example problemExample of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle). — Fig.1 Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle)..

Nearest Neighbour algorithm is a supervised classification algorithm which is non-parametric and instance based.

Intuition :"I will do whatever my nearest neighbour does.", or more precisely,the label of an example should be similar to label of nearby points.

A Problem

Problem : We are given such a dataset and we have to find the label of the circled data using this intuition.

The iterative and laborous solution : We can generate a small area around the point of interest and compute the distances between neighbours encountered iteratively.

This procedure is computationally intensive and time consuming hence we try to find a better way to solve the problem.

The second approach : Initially, we find the closest point to the circled data and compare the label. Then we eventually compute the distance from the circled data to all the datapoints and find the minimum. Here we use L2 norm but generally any norm can be used, depending on the classification problem at hand. (Note that we have to store the data of all the points in this approach.)

Pros and Cons

Advantages :

The algorithm is simple and intuitive.
The algorithm is non parametric and has no assumptions.
K-NN does not explicitly build a model, so no training step is required.
As it is an instance-based learning, K-NN is memory based and constantly evolves.

Disadvantages :

K-NN is a slow algorithm.
The algorithm works well with small number of input variables but as the number of variables grow, it struggles to predict the output of a new data point.
We can see the example given below and it is clear that the algorithm is very sensitive to outliers as it chooses the neighbors based simply on the distance criteria using the norm specified.

Outlier — Fig.4 The same problem with outlier points (circled).

Hence we move on to k- Nearest Neighbour approach and vote in to tackle this problem instead of taking just one neighbor.

There are ways in which k-NN can be sped up using different computational methods to evaluate the nearest neighbours, which have different complexities - like SciKit's KD Tree or Ball Tree method computes nearest neighbours in O[N log(N)] time while direct approach requires O[N^2] time.

Hyperparameters and Non-Linearity

There are no parameters in this algorithm. The Hyperparameters are :

"K" : The determination of the K value varies greatly depending on the case.
Distances : The distances measured in the algorithm can have norms such as the Lp norm, Euclidean Norm, Minkowski Norm etc.
The Voting Mechanism : The decision rule largely depends upon how we choose to classify the voting rule for nearest neighbour.

The nonlinearity of kNN is intuitively clear when looking at the example given above . The decision boundaries of kNN (the double lines in the Figure) are locally linear segments, but in general have a complex shape that is not equivalent to a line in 2D or a hyperplane in higher dimensions. Hence, the decision boundary here could be multiple and non linear.

Computational analysis

Example code implementation of KNN:

Python code implementing the KNN Algorithm on iris dataset and error rate comparison with the K value

References:

T. Cover and P. Hart, "Nearest neighbor pattern classification," in IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, January 1967, doi: 10.1109/TIT.1967.1053964.
Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008.
Pros and Cons of KNN algorithm.
KNN Implementation in python
Wikipedia