Classification→ATw。- Step Process Model construction: describing a set of predetermined classes Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute the set of tuples used for model construction is training set the model is represented as classification rules decision trees or mathematical formulae Model usage: for classifying future or unknown objects Estimate accuracy of the model The known label of test sample is compared with the classified result from the model Accuracy rate is the percentage of test set samples that are correctly classified by the model Test set is independent of training set otherwise overfitting If the accuracy is acceptable, use the model to classify new data Note: If the test set is used to select models, it is called validation (test) set
6 Classification—A Two-Step Process ◼ Model construction: describing a set of predetermined classes ◼ Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute ◼ The set of tuples used for model construction is training set ◼ The model is represented as classification rules, decision trees, or mathematical formulae ◼ Model usage: for classifying future or unknown objects ◼ Estimate accuracy of the model ◼ The known label of test sample is compared with the classified result from the model ◼ Accuracy rate is the percentage of test set samples that are correctly classified by the model ◼ Test set is independent of training set (otherwise overfitting) ◼ If the accuracy is acceptable, use the model to classify new data ◼ Note: If the test set is used to select models, it is called validation (test) set
ILlustrating Classification Task Tid Attrib1 Attrib2 Attrib3 Class Learning algorithm tedium 100K No Smal 4 Yes Induction 7 Yes 220K Learn Smal Model Medium 10 No Small 90K Yes Training set Model Appl Tid Attrib Attrib2 Attrib3 Class Model Small 55K 12 Yes Medium 110K Deduction 14|No 15 No 7K ? Test Set 7
7 Illustrating Classification Task Apply Model Induction Deduction Learn Model Model Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes 10 Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K ? 12 Yes Medium 80K ? 13 Yes Large 110K ? 14 No Small 95K ? 15 No Large 67K ? 10 Test Set Learning algorithm Training Set
Process (O): Model Construction Classification algorithms Training Data NAMERANK YEARS TENURED Classifier Mike Assistant Prof no (Model) Mary Assistant Prof es Bⅲ Professor 372763 yes Associate prof yes Dave Assistant Prof IF rank="professor no Anne Associate Prof OR years >6 no ThEN tenured =yes 8
8 Process (1): Model Construction Training Data NAME RANK YEARS TENURED Mike Assistant Prof 3 n o Mary Assistant Prof 7 yes Bill Professor 2 yes Jim Associate Prof 7 yes Dave Assistant Prof 6 n o Anne Associate Prof 3 n o Classification Algorithms IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ Classifier (Model)
Process(2): Using the Model in Prediction Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME RANK YEARS TENURED Tom Assistant Prof no Tenured? Merlis a Associate prof no George rofessor 2757 yes Yes Joseph Assistant Prof yes
9 Process (2): Using the Model in Prediction Classifier Testing Data NAME RANK YEARS TENURED Tom Assistant Prof 2 n o Merlisa Associate Prof 7 n o George Professor 5 yes Joseph Assistant Prof 7 yes Unseen Data (Jeff, Professor, 4) Tenured?
Algorithm for Decision Tree Induction Basic algorithm (a greedy algorithm) Tree is constructed in a top-down recursive divide-and conquer manner At start all the training examples are at the root Attributes are categorical (if continuous-valued, they are discretized in advance) Examples are partitioned recursively based on selected attributes a Test attributes are selected on the basis of a heuristic or statistical measure(e.g information gain Conditions for stopping partitioning All samples for a given node belong to the same class There are no remaining attributes for further partitioning majority voting is employed for classifying the leaf There are no samples left
10 Algorithm for Decision Tree Induction ◼ Basic algorithm (a greedy algorithm) ◼ Tree is constructed in a top-down recursive divide-andconquer manner ◼ At start, all the training examples are at the root ◼ Attributes are categorical (if continuous-valued, they are discretized in advance) ◼ Examples are partitioned recursively based on selected attributes ◼ Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain) ◼ Conditions for stopping partitioning ◼ All samples for a given node belong to the same class ◼ There are no remaining attributes for further partitioning – majority voting is employed for classifying the leaf ◼ There are no samples left