Outline a Cluster basics Clustering algorithms a Hierarchical clustering a K-means a Expectation-Maximization(EM) a Cluster Validity n determining the number of clusters a Clustering evaluation
2 Outline ◼ Cluster Basics ◼ Clustering algorithms Hierarchical clustering k-means Expectation-Maximization (EM) ◼ Cluster Validity determining the number of clusters clustering evaluation
Clustering Analysis ■ Definition 口物以类聚,人以群居 n Grouping the data with similar features It's a method of data exploration, a way of looking for patterns or structure in the V:"... data that are of interest a Properties: unsupervised parameter needed Application field: Machine learning, pattern recognition mage analysis, data mining information retrieval and K-means animation bioinformatics etc
3 Clustering Analysis ◼ Definition: 物以类聚,人以群居 Grouping the data with similar features ◼ It’s a method of data exploration, a way of looking for patterns or structure in the data that are of interest. ◼ Properties: unsupervised, parameter needed ◼ Application field: Machine learning, pattern recognition, image analysis, data mining, information retrieval and bioinformatics etc. K-means animation
Factors of Clustering What data could be used in clustering? a Large or small, Gaussian or non-Gaussian, etc a Which clustering algorithm?(cost function) Partition-based(e.g k-means n Model-based(e.g EM algorithm) a Density-based(e.g. DBSCAN) Genetic, spectral a Choosing(dis similarity measures-a critical step in clustering 口 Euclidean distance, a Pearson linear correlation a How to evaluate the clustering result?(cluster validity)
4 Factors of Clustering ◼ What data could be used in clustering? Large or small, Gaussian or non-Gaussian, etc. ◼ Which clustering algorithm? (cost function) Partition-based (e.g. k-means) Model-based (e.g. EM algorithm) Density-based (e.g. DBSCAN) Genetic, spectral …… ◼ Choosing (dis)similarity measures – a critical step in clustering Euclidean distance,… Pearson Linear Correlation,… ◼ How to evaluate the clustering result? (cluster validity)
Quality: What Is Good Clustering? A good clustering method will produce high quality clusters with a high intra-class similarity a low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation a The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
5 Quality: What Is Good Clustering? ◼ A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity ◼ The quality of a clustering result depends on both the similarity measure used by the method and its implementation ◼ The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns
Requirements of clustering in data mining(1) Scalability ability to deal with different types of attributes Discovery of clusters with arbitrary shape a Minimal requirements for domain knowledge to determine input parameters
◼ Scalability ◼ Ability to deal with different types of attributes ◼ Discovery of clusters with arbitrary shape ◼ Minimal requirements for domain knowledge to determine input parameters Requirements of clustering in data mining (1) 6