Specifying K = the number of the nearest neighbors for a k-Nearest Neighbors algorithm
-
.
-
There is no structured method to find the best value for “K”. We need to try to use training data and testing data to get accuracy.
-
A too small value for k (e.g. k=1 for 100 samples) may produce results that are too noisy and will have a higher influence on the result.
-
A too large value for k may produce categories with small number of samples in each. This is referred to as the curse of dimensionality where too many dimensions for the dataset may lead to a poor prediction of a given sample. Larger values of K will have smoother decision boundaries which mean lower variance but increased bias. Also, computationally expensive.
-
Another way to choose K is though cross-validation. One way to select the cross-validation dataset from the training dataset. Take the small portion from the training dataset and call it a validation dataset, and then use the same to evaluate different possible values of K. This way we are going to predict the label for every instance in the validation set using with K equals to 1, K equals to 2, K equals to 3. and then we look at what value of K gives us the best performance on the validation set and then we can take that value and use that as the final set of our algorithm so we are minimizing the validation error.
-
In general, practice, choosing the value of k is where N stands for the number of samples in your training dataset.
-
The Elbow method for selecting optimal k.
0
2
Contributors are:
Who are from:
Tags
Data Science
Related
Specifying a distance metric for k-Nearest Neighbors algorithm
Specifying K = the number of the nearest neighbors for a k-Nearest Neighbors algorithm
Specifying the optional weighting function on the neighbor points for a k-Nearest Neighbors algorithm
Specifying a method for aggregating the classes of neighbor points for a k-Nearest Neighbors algorithm