Debaiudh Das, 5th Yr. Int. MSc., School of
Physical Sciences, NISER | debaiudh.das@niser.ac.in
[Instructor: Dr. Subhankar Mishra, Asst. Prof., School of Computer
Sciences (CSS), NISER]
In the extremeties of a classification problem, one may either possess "the complete statistical knowledge of the underlying joint distribut’ion of the observation and the true category, or he may have no knowledge of the underlying distribution except that which can be inferred from samples." The former condition is ideally suitable for applying the standard Baye's analysis to yield a decision result (with a minimum error), while the latter can be analysed using non parametric statistical methods to yield decision rules. The nearest neighbour approach is one such non parametric statistical analysis which yields error in the range of values less than twice that of the Baye's error and hence it is a very ideal method inspite of being non-parametric.
K Nearest Neighbour is a simple algorithm that stores all the available cases and classifies the new data or case based on a similarity measure. It is mostly used to classifies a data point based on how its neighbours are classified.
Nearest Neighbour algorithm is a supervised classification algorithm which is non-parametric and instance based.
Intuition :"I will do whatever my nearest neighbour does.", or more precisely,the label of an example should be similar to label of nearby points.
Problem : We are given such a dataset and we have to find the label of the circled data using this intuition.
The iterative and laborous solution : We can generate a small area around the point of interest and compute the distances between neighbours encountered iteratively.
This procedure is computationally intensive and time consuming hence we try to find a better way to solve the problem.
The second approach : Initially, we find the closest point to the circled data and compare the label. Then we eventually compute the distance from the circled data to all the datapoints and find the minimum. Here we use L2 norm but generally any norm can be used, depending on the classification problem at hand. (Note that we have to store the data of all the points in this approach.)
Advantages :
Disadvantages :
Hence we move on to k- Nearest Neighbour approach and vote in to tackle this problem instead of taking just one neighbor.
There are ways in which k-NN can be sped up using different computational methods to evaluate the nearest neighbours, which have different complexities - like SciKit's KD Tree or Ball Tree method computes nearest neighbours in O[N log(N)] time while direct approach requires O[N^2] time.
There are no parameters in this algorithm. The Hyperparameters are :
The nonlinearity of kNN is intuitively clear when looking at the example given above . The decision boundaries of kNN (the double lines in the Figure) are locally linear segments, but in general have a complex shape that is not equivalent to a line in 2D or a hyperplane in higher dimensions. Hence, the decision boundary here could be multiple and non linear.