Classification-A Two-Step Process Model construction:describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class,as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules,decision trees,or mathematical formulae Model usage:for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model ■ Accuracy rate is the percentage of test set samples that are correctly classified by the model ■ Test set is independent of training set,otherwise over-fitting will occur If the accuracy is acceptable,use the model to classify data tuples whose class labels are not known
Classification—A Two-Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute The set of tuples used for model construction is training set The model is represented as classification rules, decision trees, or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set, otherwise over-fitting will occur If the accuracy is acceptable, use the model to classify data tuples whose class labels are not known
1.Nearest Neighbor Classifiers Basic idea:If it walks like a duck,quacks like a duck,then it's probably a duck Compute Distance Test Record P Training Choose k of the Records “nearest'”records
1. Nearest Neighbor Classifiers Basic idea: If it walks like a duck, quacks like a duck, then it’s probably a duck Training Records Test Record Compute Distance Choose k of the “nearest” records
Definition of Nearest Neighbor (a)1-nearest neighbor (b)2-nearest neighbor (c)3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
Definition of Nearest Neighbor X X X (a) 1-nearest neighbor (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x
Predict class label of test instance with major vote strategy 0 02 KNN classifier The effect of K
Predict class label of test instance with major vote strategy The effect of K KNN classifier
Remarks Highly effective method for noisy training data Target function for a whole space may be described as a combination of less complex local approximations Learning is very simple (lazy learning) × Classification is time consuming x Difficult to determine the optimal k X Curse of Dimensionality
Remarks Highly effective method for noisy training data Target function for a whole space may be described as a combination of less complex local approximations Learning is very simple (lazy learning) × Classification is time consuming × Difficult to determine the optimal k × Curse of Dimensionality