Nearest Neighbour Classifier - 2
-Nearest Neighbour Classifier
For
Classify
For
The
1: Input: training data (x0, L(x0)), ... , (xn, L(xn))
2: new x ∈ X to be classified
3:
4: for i = 0 to n do
5: Compute distance d(x, xi)
6: endfor
7: Let xi1, ... , xik be the list of items such that
8: d(x, xij) is among the k smallest distances
9: return label L which occurs most frequently in L(xi1), ... , L(xik) (majority vote)
Learning a “good”
To learn a “good”
This will result in the following two subsets of the labelled data:
The left is the training data and the right is the validation data.
The classification error of a classifier
We are saying here that we want a
Let:
be the 1-nearest neighbour classifier obtained from the training data . be the 2-nearest neighbour classifier obtained from the training data .- and so on for
.
Choose
Generalisation
- The aim of supervised learning is to do well on test data that is not know during learning.
- Choosing the values for parameters (here
) that minimise the classification error on the training data is not necessarily the best policy. - We want the learning algorithm to model true regularities in the data and ignore noise in the data.
Training and Validation Data -
If we chose the value that was best for the training data then we can see here that we would almost always go for
Training and Validation Data -
You can see that although the error of the training data goes up the “general” error (on the validation data) goes down. We want to increase
The line also gets smoother as
Properties and Training
As
- Classification boundaries become smother (possibly reflecting regularity in the data.
- Error on the training data can increase.
Summary
Advantages
-nearest neighbour is simple but effective.- Decision surface can be non-linear.
- Only a single parameter,
, easily learned by cross validation.
Disadvantages
- What does nearest mean?
- Need to specify a distance measure.
- Computational cost.
- Must store and sear through the entire training set at test time.