Decision tree Induction Many algorithms Hunt's Algorithm(one of the earliest) CART D3,C4.5 SLIQ. SPRINT C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Decision Tree Induction Many Algorithms: – Hunt’s Algorithm (one of the earliest) – CART – ID3, C4.5 – SLIQ,SPRINT
General Structure of Hunt's Algorithm Let d, be the set of training records Tid Refund Marital Taxable Status Income Cheat that reach a node t 125K General Procedure 2No Married 100K No If D. contains records that Single 70K belong the same class yt, then t Married 120K Divorced 95K is a leaf node labeled as yt Married 60K No If D, is an empty set, then t is a Divorced 220K No leaf node labeled by the default 8No Single 85K class, yd 9No Married 75K No If D. contains records that 10No Single 90K Yes belong to more than one class use an attribute test to split the data into smaller subsets Recursively apply the procedure to each subset C Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› General Structure of Hunt’s Algorithm Let Dt be the set of training records that reach a node t General Procedure: – If Dt contains records that belong the same class yt , then t is a leaf node labeled as yt – If Dt is an empty set, then t is a leaf node labeled by the default class, yd – If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10 Dt ?
Hunt's Algorithm id refund marital Taxable Status Income Cheat Single No 2No Married No (Refund Dont 3No single 70K No Cheat es 4YesMarried No Dont 5No Divorced 95K Yes Cheat Cheat Married 60K No 7 Yes Divorced 220K No 8N ngle 85K (Refund Refund 9No Married 75K No Yes No Yes No 10No Single 90K Yes Dont n Marita Marital Cheat Cheat Status Stat Single single Married Married Divorced Divorced Dont Taxable Dont Cheat Cheat Cheat Income <80K >=80K n Cheat Che C Tan, Steinbach, Kumar Introduction to Data Mining 4/18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Hunt’s Algorithm Don’t Cheat Refund Don’t Cheat Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married Taxable Income Don’t Cheat < 80K >= 80K Refund Don’t Cheat Yes No Marital Status Don’t Cheat Cheat Single, Divorced Married Tid Refund Marital Status Taxable Income Cheat 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes 10
Tree Induction Greedy strategy Split the records based on an attribute test that optimizes certain criterion ssues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Tree Induction Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. Issues – Determine how to split the records ◆How to specify the attribute test condition? ◆How to determine the best split? – Determine when to stop splitting
Tree Induction Greedy strategy Split the records based on an attribute test that optimizes certain criterion ssues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting C Tan, Steinbach, Kumar Introduction to Data Mining 18/2004
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 ‹#› Tree Induction Greedy strategy. – Split the records based on an attribute test that optimizes certain criterion. Issues – Determine how to split the records ◆How to specify the attribute test condition? ◆How to determine the best split? – Determine when to stop splitting