K-nearest neighbors algorithm

K-nearest neighbors algorithm (K-NN) is a type of instance-based learning, or lazy learning, where the function is only approximated locally and all computation is deferred until function evaluation. It is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether K-NN is used for classification or regression:

In K-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
In K-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors.

K-NN is a type of lazy learning, where the function is only approximated locally and all computation is deferred until classification. The K-NN algorithm is among the simplest of all machine learning algorithms, but its simplicity can lead to high accuracy in many real-world scenarios.

Algorithm[edit | edit source]

The K-NN working algorithm is as follows: 1. Load the data 2. Initialize K to your chosen number of neighbors 3. For each example in the data

  1. Calculate the distance between the query example and the current example from the data.
  2. Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to largest (in ascending order) by the distances 5. Pick the first K entries from the sorted collection 6. Get the labels of the selected K entries 7. If regression, return the mean of the K labels 8. If classification, return the mode of the K labels

Choosing the right value of K[edit | edit source]

Choosing the right value of K is a critical step in the K-NN algorithm. A smaller value of K means that noise will have a higher influence on the result, and a large value makes it computationally expensive. Data scientists usually choose a K that is not too small and not too large. This can be done through various methods such as cross-validation. A good K can be selected by the square root of n, where n is the total number of data points.

Distance Measures[edit | edit source]

The distance measure between two instances can be crucial in the performance of K-NN. Common measures include the Euclidean distance, Manhattan distance, and Minkowski distance. The choice of distance measure depends on the type of data.

Advantages and Disadvantages[edit | edit source]

Advantages[edit | edit source]

Simple to understand and implement
No need for model training or parameter tuning
Naturally handles multi-class cases
Can be effective if the training dataset is large

Disadvantages[edit | edit source]

The algorithm gets significantly slower as the number of examples and/or predictors/independent variables increase
Sensitive to the scale of the data and irrelevant features
Requires high memory for storing the entire training dataset for prediction

Applications[edit | edit source]

K-NN can be used in a variety of applications such as:

Search WikiMD

Ad.Tired of being Overweight? Try W8MD's physician weight loss program.
Semaglutide (Ozempic / Wegovy and Tirzepatide (Mounjaro / Zepbound) available.
Advertise on WikiMD

WikiMD is not a substitute for professional medical advice. See full disclaimer.

Credits:Most images are courtesy of Wikimedia commons, and templates Wikipedia, licensed under CC BY SA or similar.