COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei C2012 Han Kamber pei. All rights reserved
1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified by Dr. Lei Chen based on the slides provided by Jiawei Han, Micheline Kamber, and Jian Pei ©2012 Han, Kamber & Pei. All rights reserved
Chapter 10. Cluster Analysis: Basic Concepts and Methods Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical methods Density-Based Methods Grid-Based methods Evaluation of clustering Summary
2 Chapter 10. Cluster Analysis: Basic Concepts and Methods ◼ Cluster Analysis: Basic Concepts ◼ Partitioning Methods ◼ Hierarchical Methods ◼ Density-Based Methods ◼ Grid-Based Methods ◼ Evaluation of Clustering ◼ Summary 2
What is Cluster Analysis? Cluster: a collection of data objects similar(or related) to one another within the same group dissimilar (or unrelated) to the objects in other groups Cluster analysis(or clustering, data segmentation,. Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters Unsupervised learning: no predefined classes (i.e, learning by observations vs learning by examples: supervised) Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms
3 What is Cluster Analysis? ◼ Cluster: A collection of data objects ◼ similar (or related) to one another within the same group ◼ dissimilar (or unrelated) to the objects in other groups ◼ Cluster analysis (or clustering, data segmentation, …) ◼ Finding similarities between data according to the characteristics found in the data and grouping similar data objects into clusters ◼ Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning by examples: supervised) ◼ Typical applications ◼ As a stand-alone tool to get insight into data distribution ◼ As a preprocessing step for other algorithms
Clustering for Data Understanding and Applications Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species Information retrieval: document clustering Land use: ldentification of areas of similar land use in an earth observation database Marketing Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs City-planning Identifying groups of houses according to their house type, value, and geographical location Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults Climate: understanding earth climate, find patterns of atmospheric and ocean Economic Science: market resarch
4 Clustering for Data Understanding and Applications ◼ Biology: taxonomy of living things: kingdom, phylum, class, order, family, genus and species ◼ Information retrieval: document clustering ◼ Land use: Identification of areas of similar land use in an earth observation database ◼ Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs ◼ City-planning: Identifying groups of houses according to their house type, value, and geographical location ◼ Earth-quake studies: Observed earth quake epicenters should be clustered along continent faults ◼ Climate: understanding earth climate, find patterns of atmospheric and ocean ◼ Economic Science: market resarch
Clustering as a Preprocessing Tool ( Utility) Summarization Preprocessing for regression, PCA, classification, and association analysis Compression Image processing: vector quantization Finding K-nearest Neighbors Localizing search to one or a small number of clusters Outlier detection Outliers are often viewed as those far away' from any cluster
5 Clustering as a Preprocessing Tool (Utility) ◼ Summarization: ◼ Preprocessing for regression, PCA, classification, and association analysis ◼ Compression: ◼ Image processing: vector quantization ◼ Finding K-nearest Neighbors ◼ Localizing search to one or a small number of clusters ◼ Outlier detection ◼ Outliers are often viewed as those “far away” from any cluster