Major Clustering Approaches (D) Model-based A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based: Based on the analysis of frequent patterns Typical methods: p-Cluster User-guided or constraint-based Clustering by considering user-specified or application-specific constraints Typical methods: COD (obstacles), constrained clustering Link-based clustering Objects are often linked together in various ways Massive links can be used to cluster objects: SimRank, Linkclus 11
Major Clustering Approaches (II) ◼ Model-based: ◼ A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other ◼ Typical methods: EM, SOM, COBWEB ◼ Frequent pattern-based: ◼ Based on the analysis of frequent patterns ◼ Typical methods: p-Cluster ◼ User-guided or constraint-based: ◼ Clustering by considering user-specified or application-specific constraints ◼ Typical methods: COD (obstacles), constrained clustering ◼ Link-based clustering: ◼ Objects are often linked together in various ways ◼ Massive links can be used to cluster objects: SimRank, LinkClus 11
Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical methods Density-Based Methods Grid-Based methods Evaluation of clustering Summary 12
12 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 12
Partitioning Algorithms: Basic Concept Partitioning method: Partitioning a database d of n objects into a set of k clusters, such that the sum of squared distances is minimized(where C is the centroid or medoid of cluster Ci) E=Xi=epec(d(p, ci ) Given k, find a partition of k clusters that optimizes the chosen partitioning criterion Global optimal: exhaustively enumerate all partitions Heuristic methods: k-means and k-medoids algorithms k-means MacQueen 67, Lloyd 57/82: Each cluster is represented by the center of the cluster k-medoids or PAM(Partition around medoids)( Kaufman Rousseeuw87 ): Each cluster is represented by one of the objects in the cluster 13
Partitioning Algorithms: Basic Concept ◼ Partitioning method: Partitioning a database D of n objects into a set of k clusters, such that the sum of squared distances is minimized (where ci is the centroid or medoid of cluster Ci ) ◼ Given k, find a partition of k clusters that optimizes the chosen partitioning criterion ◼ Global optimal: exhaustively enumerate all partitions ◼ Heuristic methods: k-means and k-medoids algorithms ◼ k-means (MacQueen’67, Lloyd’57/’82): Each cluster is represented by the center of the cluster ◼ k-medoids or PAM (Partition around medoids) (Kaufman & Rousseeuw’87): Each cluster is represented by one of the objects in the cluster 2 1 ( ( , )) p C i k i E d p c = i = 13
The K-Means Clustering Method Given k, the k-means algorithm is implemented in four steps Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partitioning( the centroid is the center, i. e mean point, of the cluster) Assign each object to the cluster with the nearest seed point Go back to Step 2, stop when the assignment does not change 14
The K-Means Clustering Method ◼ Given k, the k-means algorithm is implemented in four steps: ◼ Partition objects into k nonempty subsets ◼ Compute seed points as the centroids of the clusters of the current partitioning (the centroid is the center, i.e., mean point, of the cluster) ◼ Assign each object to the cluster with the nearest seed point ◼ Go back to Step 2, stop when the assignment does not change 14
An Example of K-Means Clustering K=2 Arbitrarily Update the partition cluster objects into centroids groups The initial data set Loop if Reassign_objects needed Partition objects into k nonempty subsets Repeat Compute centroid (i.e, mean Update the point)for each partition cluster centroids Assign each object to the cluster of its nearest centroid ■ Until no change 15
An Example of K-Means Clustering K=2 Arbitrarily partition objects into k groups Update the cluster centroids Update the cluster centroids Reassign objects Loop if needed 15 The initial data set ◼ Partition objects into k nonempty subsets ◼ Repeat ◼ Compute centroid (i.e., mean point) for each partition ◼ Assign each object to the cluster of its nearest centroid ◼ Until no change