Requirements of clustering in data mining(2) a Ability to deal with noise and outliers 丈 Insensitivity to order of input records a High dimensionality a Incorporation of user- specified constraints Interpretability and usability
Requirements of clustering in data mining (2) ◼ Ability to deal with noise and outliers ◼ Insensitivity to order of input records ◼ High dimensionality ◼ Incorporation of userspecified constraints ◼ Interpretability and usability 7
Similarity and dissimilarit a there is no single definition of similarity or dissimilarity between data objects The definition of similarity or dissimilarity between objects depends on a the type of the data considered a what kind of similarity we are looking for
Similarity and dissimilarity 8 ◼ There is no single definition of similarity or dissimilarity between data objects ◼ The definition of similarity or dissimilarity between objects depends on the type of the data considered what kind of similarity we are looking for
Similarity and dissimilarit Similarity/dissimilarity between objects is often expressed in terms of a distance measure d(x, y) a Ideally, every distance measure should be a metric. i.e. it should satisfy the following conditions (x,y)≥0 2. d(x,y)=oiff x 3.d(x,y)=d(y,x) 4.d(x,2)≤d(x,y)+d(y,2)
Similarity and dissimilarity 9 ◼ Similarity/dissimilarity between objects is often expressed in terms of a distance measure d(x,y) ◼ Ideally, every distance measure should be a metric, i.e., it should satisfy the following conditions: 4. ( , ) ( , ) ( , ) 3. ( , ) ( , ) 2. ( , ) 0 iff 1. ( , ) 0 d x z d x y d y z d x y d y x d x y x y d x y + = = =
Euclidean distance Find distance between X and y: xEX, yEY, n="number of dimensions", A="degree of minkowski" x2Ⅹ x ="absolute value of x",W="weight vector"can be all I 单Qw0(VE(+(x-¥xD) ⊥=1 A=1 city Block(x, r)=2(**( x,-y D) 主=1 Euclidean(x, y) E(w:*(lxY, D') Y = 1=1
Euclidean distance ◼ Here D is the number of dimensions in the data vector. For instance: ◆ RGB color channel in images ◆ Logitude and latitude in GPS data = = − D i d euc x i y i 1 2 (x,y) ( ) 10
Pearson Linear Correlation (x1-x)v1- No obvious p(x,y) relationship? X -x A medium direcl relationship? x Input A strong, inverse relationship a Were shifting the expression profiles down(subtracting the means)and scaling by the standard deviations (i.e. making the data have mean =0 and std= 1) a Always between-1 and +1(perfectly anti-correlated and perfectly correlated) 11
Pearson Linear Correlation ◼ We’re shifting the expression profiles down (subtracting the means) and scaling by the standard deviations (i.e., making the data have mean = 0 and std = 1) ◼ Always between –1 and +1 (perfectly anti-correlated and perfectly correlated) = = − − − − = = = = n i i n i i n i i n i i i n i i y n y x n x x x y y x x y y 1 1 ( ) ( ) ( )( ) ( , ) 1 2 1 2 1 x y 11