Termination conditions a Several possibilities, e. g .Afixed number of iterations e data partition unchanged o Centroid positions dont change Does this mean that the data in a cluster are unchanged?
Termination conditions ◼ Several possibilities, e.g., ◆ A fixed number of iterations. ◆ data partition unchanged. ◆ Centroid positions don’t change. Does this mean that the data in a cluster are unchanged?
Convergence Why should the K-means algorithm ever reach a fixed point? A state in which clusters dont change K-means is a special case of a general procedure known as the Expectation Maximization(EM algorithm o EM is known to converge o Number of iterations could be large But in practice usually isnt
Convergence ◼ Why should the K-means algorithm ever reach a fixed point? ◆ A state in which clusters don’t change. ◼ K-means is a special case of a general procedure known as the Expectation Maximization (EM) algorithm. ◆ EM is known to converge. ◆ Number of iterations could be large. ➢ But in practice usually isn’t
Time Complexity a Computing distance between two data points IS O(D)Where d is the dimensionality of the vectors Reassigning clusters: O(KN) distance computations, or O(KND Computing centroids: Each point gets added once to some centroid: O(ND) a assume these two steps are each done once for /iterations: O(KND)
Time Complexity ◼ Computing distance between two data points is O(D) where D is the dimensionality of the vectors. ◼ Reassigning clusters: O(KN) distance computations, or O(KND). ◼ Computing centroids: Each point gets added once to some centroid: O(ND). ◼ Assume these two steps are each done once for I iterations: O(IKND)
Strengths of K-means clustering Relatively scalable in processing large data sets a Relatively efficient: O(tkn), where n is #f objects, k is clusters, and t is iterations. Normally, k, t<< n a Often terminates at a local optimum; the global optimum may be found using techniques such as genetic algorithms
Strengths of K-means clustering ◼ Relatively scalable in processing large data sets ◼ Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. ◼ Often terminates at a local optimum; the global optimum may be found using techniques such as genetic algorithms
Weaknesses of K-means clustering a Applicable only when the mean of objects is defined Need to specify k, the number of clusters in advance a Unable to handle noisy data and outliers a Not suitable to discover clusters with non-convex shapes, or clusters of very different size
Weaknesses of K-means clustering ◼ Applicable only when the mean of objects is defined ◼ Need to specify k, the number of clusters, in advance ◼ Unable to handle noisy data and outliers ◼ Not suitable to discover clusters with non-convex shapes, or clusters of very different size