Furthermore There are three big ways a data set can be large: -There are a large number of elements in the set. -Each element can have many features. -There can be many clusters to discover Conclusion-Clustering can be huge,even when you distribute it
Furthermore • There are three big ways a data set can be large: – There are a large number of elements in the set. – Each element can have many features. – There can be many clusters to discover • Conclusion – Clustering can be huge, even when you distribute it
Canopy Clustering Preliminary step to help parallelize computation. Clusters data into overlapping Canopies using super cheap distance metric. 。Efficient ·Accurate
Canopy Clustering • Preliminary step to help parallelize computation. • Clusters data into overlapping Canopies using super cheap distance metric. • Efficient • Accurate
Canopy Clustering While there are unmarked points pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it's canopy strongly mark all points within some stronger threshold 3
Canopy Clustering While there are unmarked points { pick a point which is not strongly marked call it a canopy center mark all points within some threshold of it as in it’s canopy strongly mark all points within some stronger threshold }
After the canopy clustering... Resume hierarchical or partitional clustering as usual. Treat objects in separate clusters as being at infinite distances
After the canopy clustering… • Resume hierarchical or partitional clustering as usual. • Treat objects in separate clusters as being at infinite distances
MapReduce Implementation: Problem-Efficiently partition a large data set (say...movies with user ratings!)into a fixed number of clusters using Canopy Clustering,K-Means Clustering,and a Euclidean distance measure
MapReduce Implementation: • Problem – Efficiently partition a large data set (say… movies with user ratings!) into a fixed number of clusters using Canopy Clustering, K-Means Clustering, and a Euclidean distance measure