Tree Induction Greedy strategy Split the records based on an attribute test that optimizes certain criterion ssues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Tree Induction Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. Issues – Determine how to split the records ◆How to specify the attribute test condition? ◆How to determine the best split? – Determine when to stop splitting
How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car Student Car? Type? ID? Yes N Family luxury Sport C0:6|c0:4 C0:1 c0:8 C0:1 C0: C0:1c0:0 c0:0 C1:4c1:6 C1:3C1:0 C1:7 c1:0¨c1:0c1:1c1:1 Which test condition is the best? C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to determine the Best Split Own Car? C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 Car Type? C0: 1 C1: 0 C0: 1 C1: 0 C0: 0 C1: 1 Student ID? ... Yes No Family Sports Luxury c1 c10 c20 C0: 0 C1: 1 ... c11 Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best?
How to determine the Best Split Greedy approach Nodes with homogeneous class distribution are preferred Need a measure of node impurity C0:5 C0:9 C1:5 C1:1 Non-homogeneous, Homogeneous, High degree of impurity Low degree of impurity C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to determine the Best Split Greedy approach: – Nodes with homogeneous class distribution are preferred Need a measure of node impurity: C0: 5 C1: 5 C0: 9 C1: 1 Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity
Measures of Node Impurity Gini Index Entropy Misclassification error C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Measures of Node Impurity Gini Index Entropy Misclassification error
How to Find the best split Before Splitting: a NoO MO NO1 A? B? N Yes Node ni Node n2 Node n3 Node n4 CO N10 CO N20 CO N30 CO N40 C1 N11 C1 N21 C1N31 C1 N41 M2 M M4 12 M34 Gain= Mo-M12 Vs M0-M34 C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› How to Find the Best Split B? Yes No Node N3 Node N4 A? Yes No Node N1 Node N2 Before Splitting: C0 N10 C1 N11 C0 N20 C1 N21 C0 N30 C1 N31 C0 N40 C1 N41 C0 N00 C1 N01 M0 M1 M2 M3 M4 M12 M34 Gain = M0 – M12 vs M0 – M34