Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Yang @cs. ust. hk Thanks: J W. Han, I witten e frank
1 Clustering Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs.ust.hk Thanks: J.W. Han, I. Witten, E. Frank
Essentials erminology Objects rows= records Variables attributes= features a good clustering method high on intra-class similarity and low on inter-class similarity What is similarity? Based on computation of distance Between two numerical attributes Between two nominal attributes Mixed attributes
2 Essentials ◼ Terminology: ◼ Objects = rows = records ◼ Variables = attributes = features ◼ A good clustering method ◼ high on intra-class similarity and low on inter-class similarity ◼ What is similarity? ◼ Based on computation of distance ◼ Between two numerical attributes ◼ Between two nominal attributes ◼ Mixed attributes
The database XX Object i p
3 The database n n p i i p p x x x x x x 1 1 1 1 1 ... ... ... ... ... ... ... ... Object i
Numerical attributes Distances are normally used to measure the similarity or dissimilarity between two data objects Euclidean distance d(1)=,(x2-x,}+x2-x,P+,+|x,-x2) 12J2 Jp Where/=(Ⅻ…,)andj=(场灬加)le two p-dimensional records, Manhattan distance d(i)=x.-x,+|x.-x,|+.+1x2-x
4 Numerical Attributes ◼ Distances are normally used to measure the similarity or dissimilarity between two data objects ◼ Euclideandistance: where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are two p-dimensional records, ◼ Manhattan distance | | ... | | ) 2 ( , ) (| | 2 2 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + − ( , ) | | | | ... | | 1 1 2 2 p jp x i x j x i x j x i d i j = x − + − + + −
Binary variables([0, 1], or [true, false]) contingency table for binary data Row 10 sum 6 a+b Row i d c+d sum a+c b+d Simple matching coefficient btc +6+c+d Invariant of coding of binary variable: if you assign 1 to pass"and 0 to fail or the other way around, you'll get the same distance value 5
5 Binary Variables ({0, 1}, or {true, false}) ◼ A contingency table for binary data ◼ Simple matching coefficient ◼ Invariant of coding of binary variable: if you assign 1 to “pass” and 0 to “fail”, or the other way around, you’ll get the same distance value. a b c d b c d i j + + + ( , )= + sum a c b d p c d c d a b a b sum + + + + 0 1 1 0 Row i Row j