k Nearest Neighbor Classification To classify document d into class c Define k-neighborhood n as k nearest neighbors of d Count number of documents i in n that belong to c Estimate P(c o as i/k Choose as class argmax P(co[= majority class]
k Nearest Neighbor Classification ◼ To classify document d into class c ◼ Define k-neighborhood N as k nearest neighbors of d ◼ Count number of documents i in N that belong to c ◼ Estimate P(c|d) as i/k ◼ Choose as class argmaxc P(c|d) [ = majority class]
Example: k=6(6NN) P( sclence|◇? ● Government Science Arts
Example: k=6 (6NN) Government Science Arts P(science| )?
Nearest-Neighbor Learning Algorithm Learning is just storing the representations of the training examples in D Testing instance X Compute similarity between x and all examples in D Assign x the category of the most similar example in D Does not explicitly compute a generalization or category prototypes Also called Case-based learning Memory-based learning Lazy learning
Nearest-Neighbor Learning Algorithm ◼ Learning is just storing the representations of the training examples in D. ◼ Testing instance x: ◼ Compute similarity between x and all examples in D. ◼ Assign x the category of the most similar example in D. ◼ Does not explicitly compute a generalization or category prototypes. ◼ Also called: ◼ Case-based learning ◼ Memory-based learning ◼ Lazy learning
Why K? USing only the closest example to determine the categorization is subject to errors due to A single atypical example Noise(i. e. error ) in the category label of a single training example More robust alternative is to find the k most similar examples and return the majority category of these k examples Value of k is typically odd to avoid ties: 3 and 5 are most common
Why K? ◼ Using only the closest example to determine the categorization is subject to errors due to: ◼ A single atypical example. ◼ Noise (i.e. error) in the category label of a single training example. ◼ More robust alternative is to find the k mostsimilar examples and return the majority category of these k examples. ◼ Value of k is typically odd to avoid ties; 3 and 5 are most common
knN decision boundaries Boundaries are ple arbitral surfaces but usually polyhedra ● Government Science Arts
kNN decision boundaries Government Science Arts Boundaries are in principle arbitrary surfaces – but usually polyhedra