Nominal attributes a generalization of the binary variable in that it can take more than 2 states, e.g. red yellow, blue green Method 1: Simple matching m:# of matches, p: total of variables d(i,j)=Prm Method 2: use a large number of binary variables creating a new binary variable for each of the M nominal states
6 Nominal Attributes ◼ A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green ◼ Method 1: Simple matching ◼ m: # of matches, p: total # of variables ◼ Method 2: use a large number of binary variables ◼ creating a new binary variable for each of the M nominal states p p m d i j − ( , )=
Other measures of cluster distance Minimum distance d(Ci, ci)=min peCi,p,eg IP-PI Max distance maX p∈Ci,p∈C Mean distance (C1,C1)=m Avarage distance d(C2C)=∑∑|P 7
7 Other measures of cluster distance ◼ Minimum distance ◼ Max distance ◼ Mean distance ◼ Avarage distance ( , ) min | '| d Ci Cj = pCi, p'Cj P − P ( , ) max | '| d Ci Cj = pCi, p'Cj P − P d Ci Cj = mi − mj ( , ) = − p C i p C j P P n n d C C i j i j | '| 1 ( , )
Major clustering methods u Partition based(K-means) Produces sphere-like clusters Good when know number of clusters Small and med sized databases Hierarchical methods(agglomerative or divisive) Produces trees of clusters Fast Density based(dbscan Produces arbitrary shaped clusters Good when dealing with spatial clusters(maps) Grid-based Produces clusters based on grids Fast for large, multidimensional databases Model-based Based on statistical models Allow objects to belong to several clusters 8
8 Major clustering methods ◼ Partition based (K-means) ◼ Produces sphere-like clusters ◼ Good when ◼ know number of clusters, ◼ Small and med sized databases ◼ Hierarchical methods (Agglomerative or divisive) ◼ Produces trees of clusters ◼ Fast ◼ Density based (DBScan) ◼ Produces arbitrary shaped clusters ◼ Good when dealing with spatial clusters (maps) ◼ Grid-based ◼ Produces clusters based on grids ◼ Fast for large, multidimensional databases ◼ Model-based ◼ Based on statistical models ◼ Allow objects to belong to several clusters
The K-Means Clustering method: for numerical attributes Given k, the k-means algorithm is implemented in four steps Partition objects into k non-empty subsets Compute seed points as the centroids of the clusters of the current partition the centroid is the center, i. e. mean point, of the cluster) Assign each object to the cluster with the nearest seed point go back to Step 2, stop when no more new assignment
9 The K-Means Clustering Method: for numerical attributes ◼ Given k, the k-means algorithm is implemented in four steps: ◼ Partition objects into k non-empty subsets ◼ Compute seed points as the centroids of the clusters of the current partition (the centroid is the center, i.e., mean point, of the cluster) ◼ Assign each object to the cluster with the nearest seed point ◼ Go back to Step 2, stop when no more new assignment
The mean point Y 2.5 2.75 The mean point can be a virtual point 10
10 The mean point 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 0 1 2 3 4 5 X X Y 1 2 2 4 3 3 4 2 2.5 2.75 The mean point can be a virtual point