The K-Means Clustering Method Example Assign Gate each the objects°23 cluste to most means similar reassign reassign K=2 Arbitrarily choose K object as initial cluster center Update the means 012345678910 11
11 The K-Means Clustering Method ◼ Example 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 K=2 Arbitrarily choose K object as initial cluster center Assign each objects to most similar center Update the cluster means Update the cluster means reassign reassign
Comments on the k-means method Strength: Relatively efficient: atkn), where n is objects, kis clusters, and t is iterations. Normally k, t<<n Comment: Often terminates at a /ocaloptimum Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k the numberof clusters in advance Unable to handle noisy data and outliers too wel/ Not suitable to discover clusters with non-convex shapes 12
12 Comments on the K-Means Method ◼ Strength: Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. ◼ Comment: Often terminates at a local optimum. ◼ Weakness ◼ Applicable only when mean is defined, then what about categorical data? ◼ Need to specify k, the number of clusters, in advance ◼ Unable to handle noisy data and outliers too well ◼ Not suitable to discover clusters with non-convex shapes
Robustness Ⅹ 2 2 4 101.5 2.75 1000 13
13 Robustness 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 1 10 100 1000 X X Y 1 2 2 4 3 3 400 2 101.5 2.75
Variations of the k-means method A few variants of the k-means which differ in Selection of the initial k means Dissimilarity calculations Strategies to calculate cluster means Handling categorical data: k-modes(huang 98) Replacing means of clusters with modes Using new dissimilarity measures to deal with categorical objects Using a frequency-based method to update modes of clusters A mixture of categorical and numerical data k-prototype method 14
14 Variations of the K-Means Method ◼ A few variants of the k-means which differ in ◼ Selection of the initial k means ◼ Dissimilarity calculations ◼ Strategies to calculate cluster means ◼ Handling categorical data: k-modes (Huang’98) ◼ Replacing means of clusters with modes ◼ Using new dissimilarity measures to deal with categorical objects ◼ Using a frequency-based method to update modes of clusters ◼ A mixture of categorical and numerical data: k-prototype method
K-Modes: See J X. Huang s paper online (Data Mining and Knowledge Discovery Journal, Springer) 3. The k-means algorithm The k-means algorithm(Mac Queen, 1967; Anderberg, 1973), one of the mostly used clustering algorithms, is classified as a partitional or nonhierarchical clustering method (ain and Dubes, 1988). Given a set of numeric objects X and an integer number k(<n), the k-means algorithm searches for a partition of X into k clusters that minimises the within groups sum of squared errors (WGSS). This process is often formulated as the following mathematical program problem P(Selim and Ismail, 1984; Bobrowski and Bezdek, 1991) Minimise p(,Q)=∑∑md(x,Q) subject to <I<n ∈{0,1},1≤i≤n,1≤l≤k where W is an n x k partition matrix, 0=(01, 02, .., 0k is a set of objects in the same object domain, and d(, )is the squared euclidean distance between two objects 15
15 K-Modes: See J. X. Huang’s paper online (Data Mining and Knowledge Discovery Journal, Springer)