CS460/660- Machine Learning

K-Nearest Neighbour | 13th October 2020

Debaiudh Das, 5th Yr. Int. MSc., School of Physical Sciences, NISER | debaiudh.das@niser.ac.in
[Instructor: Dr. Subhankar Mishra, Asst. Prof., School of Computer Sciences (CSS), NISER]


K-Nearest Neighbour



Jump to:
    Introduction
    A Problem
    Pros and Cons
    Hyperparameters
    Computational Analysis

Introduction

In the extremeties of a classification problem, one may either possess "the complete statistical knowledge of the underlying joint distribut’ion of the observation and the true category, or he may have no knowledge of the underlying distribution except that which can be inferred from samples." The former condition is ideally suitable for applying the standard Baye's analysis to yield a decision result (with a minimum error), while the latter can be analysed using non parametric statistical methods to yield decision rules. The nearest neighbour approach is one such non parametric statistical analysis which yields error in the range of values less than twice that of the Baye's error and hence it is a very ideal method inspite of being non-parametric.

K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbours are classified.

Example problemExample of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).

Fig.1 Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle)..


Nearest Neighbour algorithm is a supervised classification algorithm which is non-parametric and instance based.

Intuition :"I will do whatever my nearest neighbour does.", or more precisely,the label of an example should be similar to label of nearby points.


A Problem

Problem : We are given such a dataset and we have to find the label of the circled data using this intuition.

Classification

Fig.2 The Classification Problem.

The iterative and laborous solution : We can generate a small area around the point of interest and compute the distances between neighbours encountered iteratively.

Iterative

Fig.3 The Iterative.

This procedure is computationally intensive and time consuming hence we try to find a better way to solve the problem.

The second approach : Initially, we find the closest point to the circled data and compare the label. Then we eventually compute the distance from the circled data to all the datapoints and find the minimum. Here we use L2 norm but generally any norm can be used, depending on the classification problem at hand. (Note that we have to store the data of all the points in this approach.)


Pros and Cons

Advantages :

Disadvantages :

Outlier

Fig.4 The same problem with outlier points (circled).

Hence we move on to k- Nearest Neighbour approach and vote in to tackle this problem instead of taking just one neighbor.

There are ways in which k-NN can be sped up using different computational methods to evaluate the nearest neighbours, which have different complexities - like SciKit's KD Tree or Ball Tree method computes nearest neighbours in O[N log(N)] time while direct approach requires O[N^2] time.


Hyperparameters and Non-Linearity

There are no parameters in this algorithm. The Hyperparameters are :

NonL

Fig.5 Classification problem highlighting multiplicity and non linearity of decision boundary of the KNN Algorithm.

The nonlinearity of kNN is intuitively clear when looking at the example given above . The decision boundaries of kNN (the double lines in the Figure) are locally linear segments, but in general have a complex shape that is not equivalent to a line in 2D or a hyperplane in higher dimensions. Hence, the decision boundary here could be multiple and non linear.


Computational analysis

Example code implementation of KNN:

Python code implementing the KNN Algorithm on iris dataset and error rate comparison with the K value


References:

  1. T. Cover and P. Hart, "Nearest neighbor pattern classification," in IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21-27, January 1967, doi: 10.1109/TIT.1967.1053964.
  2. Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, "Introduction to Information Retrieval", Cambridge University Press. 2008.
  3. Pros and Cons of KNN algorithm.
  4. KNN Implementation in python
  5. Wikipedia