Formalization of K-means Problem P can be solved by iteratively solving the following two problems problem PI: FixQ=@ and solve the reduced problem P(w, 2) 2. Problem P2: Fix W=w and solve the reduced problem P(w, 2) Problem P, is solved by Wi/= l if d(i, on<d(Xi, o,, for I<t< k 0fort≠7 and problem P2 is solved by Ui. 1 xi q1, 1= 1. 7 forl≤l≤k,andl≤j≤m 16
16 Formalization of K-Means
K-Means, Cont The basic algorithm to solve problem P is given as follows( Selim and Ismail, 1984; Bobrowski and Bezdek, 1991) 1. Choose an initial and solve P(w, 0 to obtain 0. Set t=0 2. Let W=W"and solve P(W, g to obtain (+. If P(W, 0)=P(, 0+), output W, 0 and stop; otherwise, go to 3 3. Let 0=0+ and solve P(W, 2 to obtain W+l If P(,0 =P(7+, 0), output W, 2 and stop: otherwise, let t=t +1 and go to 2 17
17 K-Means: Cont
K-Modes: See J X. Huang s paper online (Data Mining and Knowledge Discovery Journal, Springer) 4.1. Dissimilarity measure Let X, y be two categorical objects described by m categorical attributes. The dissimilarity measure between X and y can be defined by the total mismatches of the corresponding attribute categories of the two objects The smaller the number of mismatches is, the more similar the two objects. This measure is often referred to as simple matching(Kaufman and Rousseeuw, 1990). Formally x1)=∑6(x,y where 6(x1,y)= (x≠y) 18
18 K-Modes: See J. X. Huang’s paper online (Data Mining and Knowledge Discovery Journal, Springer)
K-Modes(Cont 4. 2. Mode of a set Definition 1. A mode of X=X1, X, Xn is a vector 0=la1, 92,, un that DK=∑0 Here, 0 is not necessarily an element of X. 19
19 K-Modes (Cont.)
K-Modes 4.3. Find a mode for a set Let nck na- the relative frequency of category Ck in Y. be the number of objects having the kth category Ck. In attribute A; and fI(A Theorem 1. The function D(X, @)is minimised iff f(d; =q X)2f.(4;=ckj IX) 1og=c:mal=1,…m The proof of Theorem I is given in appendix Theorem I defines a way to find g from a given X, and therefore is important because it allows the k-means paradigm to be used to cluster categorical data. The theorem implies that the mode of a data set X is not unique. For example, the mode of set[a, b], a, c] Ic, b], b, c can be either [a, b] or a, c
20 K-Modes