Considerations for Cluster Analysis a Partitioning criteria Single level Vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) Separation of clusters EXclusive(e.g one customer belongs to only one region)Vs non exclusive(e.g, one document may belong to more than one class Similarity measure Distance-based(e. g, Euclidian, road network, vector)Vs connectivity-based( e.g., density or contiguity) Clustering space Full space(often when low dimensional)Vs subspaces(often in high-dimensional clustering) 11
Considerations for Cluster Analysis ◼ Partitioning criteria ◼ Single level vs. hierarchical partitioning (often, multi-level hierarchical partitioning is desirable) ◼ Separation of clusters ◼ Exclusive (e.g., one customer belongs to only one region) vs. nonexclusive (e.g., one document may belong to more than one class) ◼ Similarity measure ◼ Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based (e.g., density or contiguity) ◼ Clustering space ◼ Full space (often when low dimensional) vs. subspaces (often in high-dimensional clustering) 11
Requirements and challenges ■ Scalability Clustering all the data instead of only on samples Ability to deal with different types of attributes Numerical, binary, categorical, ordinal, linked, and mixture of the ese Constraint-based clustering User may give inputs on constraints Use domain knowledge to determine input parameters Interpretability and usability Others Discovery of clusters with arbitrary shape Ability to deal with noisy data Incremental clustering and insensitivity to input order High dimensionality
Requirements and Challenges ◼ Scalability ◼ Clustering all the data instead of only on samples ◼ Ability to deal with different types of attributes ◼ Numerical, binary, categorical, ordinal, linked, and mixture of these ◼ Constraint-based clustering ◼ User may give inputs on constraints ◼ Use domain knowledge to determine input parameters ◼ Interpretability and usability ◼ Others ◼ Discovery of clusters with arbitrary shape ◼ Ability to deal with noisy data ◼ Incremental clustering and insensitivity to input order ◼ High dimensionality 12
Major Clustering Approaches o Partitioning approach Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors Typical methods: k-means, k-medoids, CLARANS Hierarchical approach Create a hierarchical decomposition of the set of data(or objects) using some criterion Typical methods: Diana, Agnes, BIRCH, CAMELEON Density-based approach Based on connectivity and density functions Typical methods: DBSACN, OPTICS, DenClue Grid-based approach based on a multiple-level granularity structure Typical methods: STING, Wave Cluster, CLIQUE
Major Clustering Approaches (I) ◼ Partitioning approach: ◼ Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors ◼ Typical methods: k-means, k-medoids, CLARANS ◼ Hierarchical approach: ◼ Create a hierarchical decomposition of the set of data (or objects) using some criterion ◼ Typical methods: Diana, Agnes, BIRCH, CAMELEON ◼ Density-based approach: ◼ Based on connectivity and density functions ◼ Typical methods: DBSACN, OPTICS, DenClue ◼ Grid-based approach: ◼ based on a multiple-level granularity structure ◼ Typical methods: STING, WaveCluster, CLIQUE 13
Major Clustering Approaches (D) Model-based a model is hypothesized for each of the clusters and tries to find the best fit of that model to each other Typical methods: EM, SOM, COBWEB Frequent pattern-based Based on the analysis of frequent patterns Typical methods: p-Cluster User-guided or constraint-based Clustering by considering user-specified or application-specific constraints Typical methods: COD(obstacles), constrained clustering Link-based clustering: Objects are often linked together in various ways Massive links can be used to cluster objects SimRank LinkClus 14
Major Clustering Approaches (II) ◼ Model-based: ◼ A model is hypothesized for each of the clusters and tries to find the best fit of that model to each other ◼ Typical methods: EM, SOM, COBWEB ◼ Frequent pattern-based: ◼ Based on the analysis of frequent patterns ◼ Typical methods: p-Cluster ◼ User-guided or constraint-based: ◼ Clustering by considering user-specified or application-specific constraints ◼ Typical methods: COD (obstacles), constrained clustering ◼ Link-based clustering: ◼ Objects are often linked together in various ways ◼ Massive links can be used to cluster objects: SimRank, LinkClus 14
Notion of a Cluster can be Ambiguous ◇◆口 How many clusters? Six Clusters △ △△△ ◇◇ Two Clusters Four Clusters 15
15 Notion of a Cluster can be Ambiguous How many clusters? Two Clusters Four Clusters Six Clusters