k-means++:The Advantages of Careful Seeding David Arthur Sergei Vassilvitskiit Abstract than most of its competitors. The k-means method is a widely used clustering technique Unfortunately,the empirical speed and simplicity that seeks to minimize the average squared distance between of the k-means algorithm come at the price of accuracy. points in the same cluster.Although it offers no accuracy There are many natural examples for which the algo- guarantees,its simplicity and speed are very appealing in rithm generates arbitrarily bad clusterings (i.e.,is practice.By augmenting k-means with a very simple,ran- unbounded even when n and k are fixed).Furthermore. domized seeding technique,we obtain an algorithm that is these examples do not rely on an adversarial placement e(logk)-competitive with the optimal clustering.Prelim-of the starting centers,and the ratio can be unbounded inary experiments show that our augmentation improves with high probability even with the standard random- both the speed and the accuracy of k-means,often quite ized seeding technique. dramatically. In this paper,we propose a way of initializing k-means by choosing random starting centers with 1 Introduction very specific probabilities.Specifically,we choose a Clustering is one of the classic problems in machine point p as a center with probability proportional to p's learning and computational geometry.In the popular contribution to the overall potential.Letting o denote k-means formulation,one is given an integer k and a set the potential after choosing centers in this way,we show of n data points in Rd.The goal is to choose k centers the following. so as to minimize o,the sum of the squared distances THEOREM 1.1.For any set of data points,El]< between each point and its closest center. 8(Ink +2)ooPT. Solving this problem exactly is NP-hard,even with just two clusters [10],but twenty-five years ago,Lloyd This sampling is both fast and simple,and it already [20]proposed a local search solution that is still very achieves approximation guarantees that k-means can- widely used today (see for example [1,11,15]).Indeed, not.We propose using it to seed the initial centers a recent survey of data mining techniques states that it for k-means,leading to a combined algorithm we call "is by far the most popular clustering algorithm used in k-means++. scientific and industrial applications"[5]. This complements a very recent result of Ostrovsky Usually referred to simply ask-means,Lloyd's et al.24],who independently proposed much the same algorithm begins withk arbitrary centers,typically algorithm.Whereas they showed this randomized seed- chosen uniformly at random from the data points.Each ing is O(1)-competitive on data sets following a certain point is then assigned to the nearest center,and each separation condition,we show it is O(log k)-competitive center is recomputed as the center of mass of all points on all data sets. assigned to it.These two steps(assignment and center We also show that the analysis for Theorem 1.1 is calculation)are repeated until the process stabilizes. tight up to a constant factor,and that it can be eas- One can check that the total error is monotoni-ily extended to various potential functions in arbitrary cally decreasing,which ensures that no clustering is re- metric spaces.In particular,we can also get a sim- peated during the course of the algorithm.Since there ple O(logk)approximation algorithm for the k-median are at most k possible clusterings,the process will al- objective.Furthermore,we provide preliminary experi- ways terminate.In practice,very few iterations are usu- mental data showing that in practice,k-means++really ally required,which makes the algorithm much faster does outperform k-means in terms of both accuracy and speed,often by a substantial margin Stanford University,Supported in part by NDSEG Fellow-1.1 Related work As a fundamental problem in ship,NSF Grant ITR-0331640,and grants from Media-X and SNRC. machine learning,k-means has a rich history.Because tStanford University,Supported in part by NSF Grant ITR-of its simplicity and its observed speed,Lloyd's method 0331640,and grants from Media-X and SNRC. [20]remains the most popular approach in practice
k-means++: The Advantages of Careful Seeding David Arthur ∗ Sergei Vassilvitskii† Abstract The k-means method is a widely used clustering technique that seeks to minimize the average squared distance between points in the same cluster. Although it offers no accuracy guarantees, its simplicity and speed are very appealing in practice. By augmenting k-means with a very simple, randomized seeding technique, we obtain an algorithm that is Θ(log k)-competitive with the optimal clustering. Preliminary experiments show that our augmentation improves both the speed and the accuracy of k-means, often quite dramatically. 1 Introduction Clustering is one of the classic problems in machine learning and computational geometry. In the popular k-means formulation, one is given an integer k and a set of n data points in R d . The goal is to choose k centers so as to minimize φ, the sum of the squared distances between each point and its closest center. Solving this problem exactly is NP-hard, even with just two clusters [10], but twenty-five years ago, Lloyd [20] proposed a local search solution that is still very widely used today (see for example [1, 11, 15]). Indeed, a recent survey of data mining techniques states that it “is by far the most popular clustering algorithm used in scientific and industrial applications” [5]. Usually referred to simply as k-means, Lloyd’s algorithm begins with k arbitrary centers, typically chosen uniformly at random from the data points. Each point is then assigned to the nearest center, and each center is recomputed as the center of mass of all points assigned to it. These two steps (assignment and center calculation) are repeated until the process stabilizes. One can check that the total error φ is monotonically decreasing, which ensures that no clustering is repeated during the course of the algorithm. Since there are at most k n possible clusterings, the process will always terminate. In practice, very few iterations are usually required, which makes the algorithm much faster ∗Stanford University, Supported in part by NDSEG Fellowship, NSF Grant ITR-0331640, and grants from Media-X and SNRC. †Stanford University, Supported in part by NSF Grant ITR- 0331640, and grants from Media-X and SNRC. than most of its competitors. Unfortunately, the empirical speed and simplicity of the k-means algorithm come at the price of accuracy. There are many natural examples for which the algorithm generates arbitrarily bad clusterings (i.e., φ φOPT is unbounded even when n and k are fixed). Furthermore, these examples do not rely on an adversarial placement of the starting centers, and the ratio can be unbounded with high probability even with the standard randomized seeding technique. In this paper, we propose a way of initializing k-means by choosing random starting centers with very specific probabilities. Specifically, we choose a point p as a center with probability proportional to p’s contribution to the overall potential. Letting φ denote the potential after choosing centers in this way, we show the following. Theorem 1.1. For any set of data points, E[φ] ≤ 8(ln k + 2)φOP T . This sampling is both fast and simple, and it already achieves approximation guarantees that k-means cannot. We propose using it to seed the initial centers for k-means, leading to a combined algorithm we call k-means++. This complements a very recent result of Ostrovsky et al. [24], who independently proposed much the same algorithm. Whereas they showed this randomized seeding is O(1)-competitive on data sets following a certain separation condition, we show it is O(log k)-competitive on all data sets. We also show that the analysis for Theorem 1.1 is tight up to a constant factor, and that it can be easily extended to various potential functions in arbitrary metric spaces. In particular, we can also get a simple O(log k) approximation algorithm for the k-median objective. Furthermore, we provide preliminary experimental data showing that in practice, k-means++ really does outperform k-means in terms of both accuracy and speed, often by a substantial margin. 1.1 Related work As a fundamental problem in machine learning, k-means has a rich history. Because of its simplicity and its observed speed, Lloyd’s method [20] remains the most popular approach in practice
despite its limited accuracy.The convergence time of For the k-means problem,we are given an integer k Lloyd's method has been the subject of a recent series and a set of n data points cRd.We wish to choose of papers [2,4,8,14];in this work we focus on improving k centers C so as to minimize the potential function, its accuracy. In the theory community,Inaba et al.[16]were ∑mix-c2. cEC the first to give an exact algorithm for the k-means problem,with the running time of O(nkd).Since then,a number of polynomial time approximation schemes have Choosing these centers implicitly defines a clustering been developed (see [9,13,19,21]and the references for each center,we set one cluster to be the set of therein).While the authors develop interesting insights data points that are closer to that center than to any into the structure of the clustering problem,their other.As noted above,finding an exact solution to the algorithms are highly exponential (or worse)in k,and k-means problem is NP-hard. are unfortunately impractical even for relatively small Throughout the paper,we will let Copr denote the n,k and d. optimal clustering for a given instance of the k-means Kanungo et al.[17]proposed an O(n3e-d)algorithm problem,and we will let oopr denote the corresponding that is (9+e)-competitive.However,n compares potential.Given a clustering C with potential o,we unfavorably with the almost linear running time of also let (A)denote the contribution of A C&to the Lloyd's method,and the exponential dependence on d potential (i.e.,(A)=Amincecll-c2). can also be problematic.For these reasons,Kanungo et al.also suggested a way of combining their techniques 2.1 The k-means algorithm The k-means method with Lloyd's algorithm,but in order to avoid the is a simple and fast algorithm that attempts to locally exponential dependence on d,their approach sacrifices improve an arbitrary k-means clustering.It works as all approximation guarantees. follows. Mettu and Plaxton [22]also achieved a constant- 1.Arbitrarily choose k initial centers C ={c1,...,ck}. probability O(1)approximation using a technique called 2.For each i{1,...,k},set the cluster Ci to be the successive sampling.They match our running time of set of points in that are closer to ci than they O(nkd),but only if k is sufficiently large and the spread are to ci for all j≠i. is sufficiently small.In practice,our approach is simpler, 3.For each i {1,...,k},set ci to be the center of and our experimental results seem to be better in terms of both speed and accuracy. mass of all points in C:c=高∑xec,z Very recently,Ostrovsky et al.[24]independently 4.Repeat Steps 2 and 3 until C no longer changes. proposed an algorithm that is essentially identical to It is standard practice to choose the initial centers ours,although their analysis is quite different.Letting uniformly at random from For Step 2,ties may be ooprk denote the optimal potential for a k-clustering broken arbitrarily,as long as the method is consistent. on a given data set,they prove k-means++is O(1)- Steps 2 and 3 are both guaranteed to decrease o,so competitive in the case where The theagorith makes local improvements to an arbitrary intuition here is that if this condition does not hold,clustering until it is no longer possible to do so.To see then the data is not well suited for clustering with the that Step 3 does in fact decreases o,it is helpful to recall given value for k. a standard result from linear algebra (see [14]). Combining this result with ours gives a strong characterization of the algorithm's performance.In LEMMA 2.1.Let S be a set of points with center of particular,k-means++is never worse than O(logk)- mass c(S),and let z be an arbitrary point.Then, competitive,and on very well formed data sets,it ∑ESl-22-∑resl-c(S)I2=lS·lc(S)-2. improves to being O(1)-competitive. Monotonicity for Step 3 follows from taking s to be a Overall,the seeding technique we propose is similar single cluster and z to be its initial center. in spirit to that used by Meyerson [23]for online facility As discussed above,the k-means algorithm is at- location,and Mishra et al.[12]and Charikar et al.[6] tractive in practice because it is simple and it is gener- in the context of k-median clustering.However,our ally fast.Unfortunately,it is guaranteed only to find a analysis is quite different from those works. local optimum,which can often be quite poor. 2 Preliminaries 2.2 The k-means++algorithm The k-means algo- In this section,we formally define the k-means problem,rithm begins with an arbitrary set of cluster centers. as well as the k-means and k-means++algorithms. We propose a specific way of choosing these centers.At
despite its limited accuracy. The convergence time of Lloyd’s method has been the subject of a recent series of papers [2, 4, 8, 14]; in this work we focus on improving its accuracy. In the theory community, Inaba et al. [16] were the first to give an exact algorithm for the k-means problem, with the running time of O(n kd). Since then, a number of polynomial time approximation schemes have been developed (see [9, 13, 19, 21] and the references therein). While the authors develop interesting insights into the structure of the clustering problem, their algorithms are highly exponential (or worse) in k, and are unfortunately impractical even for relatively small n, k and d. Kanungo et al. [17] proposed an O(n 3 −d ) algorithm that is (9 + )-competitive. However, n 3 compares unfavorably with the almost linear running time of Lloyd’s method, and the exponential dependence on d can also be problematic. For these reasons, Kanungo et al. also suggested a way of combining their techniques with Lloyd’s algorithm, but in order to avoid the exponential dependence on d, their approach sacrifices all approximation guarantees. Mettu and Plaxton [22] also achieved a constantprobability O(1) approximation using a technique called successive sampling. They match our running time of O(nkd), but only if k is sufficiently large and the spread is sufficiently small. In practice, our approach is simpler, and our experimental results seem to be better in terms of both speed and accuracy. Very recently, Ostrovsky et al. [24] independently proposed an algorithm that is essentially identical to ours, although their analysis is quite different. Letting φOPT,k denote the optimal potential for a k-clustering on a given data set, they prove k-means++ is O(1)- competitive in the case where φOPT,k φOPT,k−1 ≤ 2 . The intuition here is that if this condition does not hold, then the data is not well suited for clustering with the given value for k. Combining this result with ours gives a strong characterization of the algorithm’s performance. In particular, k-means++ is never worse than O(log k)- competitive, and on very well formed data sets, it improves to being O(1)-competitive. Overall, the seeding technique we propose is similar in spirit to that used by Meyerson [23] for online facility location, and Mishra et al. [12] and Charikar et al. [6] in the context of k-median clustering. However, our analysis is quite different from those works. 2 Preliminaries In this section, we formally define the k-means problem, as well as the k-means and k-means++ algorithms. For the k-means problem, we are given an integer k and a set of n data points X ⊂ R d . We wish to choose k centers C so as to minimize the potential function, φ = X x∈X min c∈C kx − ck 2 . Choosing these centers implicitly defines a clustering – for each center, we set one cluster to be the set of data points that are closer to that center than to any other. As noted above, finding an exact solution to the k-means problem is NP-hard. Throughout the paper, we will let COPT denote the optimal clustering for a given instance of the k-means problem, and we will let φOPT denote the corresponding potential. Given a clustering C with potential φ, we also let φ(A) denote the contribution of A ⊂ X to the potential (i.e., φ(A) = P x∈A minc∈Ckx − ck 2 ). 2.1 The k-means algorithm The k-means method is a simple and fast algorithm that attempts to locally improve an arbitrary k-means clustering. It works as follows. 1. Arbitrarily choose k initial centers C = {c1, . . . , ck}. 2. For each i ∈ {1, . . . , k}, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all j 6= i. 3. For each i ∈ {1, . . . , k}, set ci to be the center of mass of all points in Ci : ci = 1 |Ci| P x∈Ci x. 4. Repeat Steps 2 and 3 until C no longer changes. It is standard practice to choose the initial centers uniformly at random from X . For Step 2, ties may be broken arbitrarily, as long as the method is consistent. Steps 2 and 3 are both guaranteed to decrease φ, so the algorithm makes local improvements to an arbitrary clustering until it is no longer possible to do so. To see that Step 3 does in fact decreases φ, it is helpful to recall a standard result from linear algebra (see [14]). Lemma 2.1. Let S be a set of points with center of mass P c(S), and let z be an arbitrary point. Then, x∈S kx − zk 2 − P x∈S kx − c(S)k 2 = |S| · kc(S) − zk 2 . Monotonicity for Step 3 follows from taking S to be a single cluster and z to be its initial center. As discussed above, the k-means algorithm is attractive in practice because it is simple and it is generally fast. Unfortunately, it is guaranteed only to find a local optimum, which can often be quite poor. 2.2 The k-means++ algorithm The k-means algorithm begins with an arbitrary set of cluster centers. We propose a specific way of choosing these centers. At
any given time,let D(z)denote the shortest distance Our next step is to prove an analog of Lemma 3.1 from a data point z to the closest center we have al-for the remaining centers,which are chosen with D2 ready chosen.Then,we define the following algorithm,weighting. which we call k-means++. LEMMA 3.2.Let A be an arbitrary cluster in CoPT,and 1a.Choose an initial center c uniformly at random let c be an arbitrary clustering.If we add a random from center to C from A,chosen with D2 weighting,then 1b.Choose the next center ci,selecting ci=z'E D(')2 E[(A)]≤8poPT(A): with probability( 1c.Repeat Step 1b until we have chosen a total of k Proof.The probability that we choose some fixed ao as centers. our center,given that we are choosing our center from D(ao)2 24.Proceed as with the standard agorith.is preciselyFrthermore,after choo We call the weighting used in Step Ib simply "D2 ing the center o point a will contribute precisely weighting”. min(D(a),lla-aoll)2 to the potential.Therefore, 3 k-means++is O(logk)-competitive D(ao)2 In this section,we prove our main result. l(A)eA D(a)min(D(a).la-dol. a∈A THEOREM 3.1.If C is constructed with k-means++, Note by the triangle inequality that D(ao)< then the corresponding potential function satisfies D(a)+a-aoll for all a,ao.From this,the power- E[1≤8(lnk+2)poPT. mean inequality!implies that D(ao)2 2D(a)2+ 2la aoll2. Summing over all a.we then have that In fact,we prove this holds after only Step 1 of the algorithm above.Steps 2 through 4 can then only D(ao)2≤高∑aeAD(a2+∑Ala-aol,and hence,E(A)]is at most, decrease o.Not surprisingly,our experiments show this local optimization is important in practice,although it 2 is difficult to quantify this theoretically. ∑aeAD(a2 ∑mim(D(a),la-aol2 Our analysis consists of two parts.First,we show 之AaAi○、 aEA that k-means++is competitive in those clusters of Copr 2 from which it chooses a center.This is easiest in the case of our first center,which is chosen uniformly at +A。 之a∈Aa-aoll2 ∑aEA D(a2 >min(D(a),lla-aoll)2. aEA random. In the first expression,we substitute min(D(a),la- LEMMA 3.1.Let A be an arbitrary cluster in CopT,and aoll)2la-aoll2,and in the second expression,we let C be the clustering with just one center,which is substitute min(D(a),lla-aoll)2<D(a)2.Simplifying, chosen uniformly at random from A.Then,Elo(A)]= we then have, 200PT(A). Proof.Let c(A)denote the center of mass of the data EioA1≤4·∑∑Ia-aol points in A.By Lemma 2.1,we know that since 前e 80oPT(A). Copr is optimal,it must be using c(A)as the center corresponding to the cluster A.Using the same lemma The last step here follows from Lemma 3.1. again,we see E[(A)]is given by, We have now shown that seeding by D2 weighting 向 -ar is competitive as long as it chooses centers from each cluster of Copr,which completes the first half of our argument.We now use induction to show the total error ∑a-cAP+l4lao-c42 in general is at most O(logk). = 2∑Ia-c(4I2, IThe power-mean inequality states for any real numbers a∈A a1,…,am that2a号≥a(②a,)2,It follows from the Cauchy- Schwarz inequality.We are only using the case m=2 here,but and the result follows. we will need the general case for Lemma 3.3
any given time, let D(x) denote the shortest distance from a data point x to the closest center we have already chosen. Then, we define the following algorithm, which we call k-means++. 1a. Choose an initial center c1 uniformly at random from X . 1b. Choose the next center ci , selecting ci = x 0 ∈ X with probability D(x 0 ) 2 P x∈X D(x) 2 . 1c. Repeat Step 1b until we have chosen a total of k centers. 2-4. Proceed as with the standard k-means algorithm. We call the weighting used in Step 1b simply “D2 weighting”. 3 k-means++ is O(log k)-competitive In this section, we prove our main result. Theorem 3.1. If C is constructed with k-means++, then the corresponding potential function φ satisfies E[φ] ≤ 8(ln k + 2)φOPT. In fact, we prove this holds after only Step 1 of the algorithm above. Steps 2 through 4 can then only decrease φ. Not surprisingly, our experiments show this local optimization is important in practice, although it is difficult to quantify this theoretically. Our analysis consists of two parts. First, we show that k-means++ is competitive in those clusters of COPT from which it chooses a center. This is easiest in the case of our first center, which is chosen uniformly at random. Lemma 3.1. Let A be an arbitrary cluster in COPT, and let C be the clustering with just one center, which is chosen uniformly at random from A. Then, E[φ(A)] = 2φOPT(A). Proof. Let c(A) denote the center of mass of the data points in A. By Lemma 2.1, we know that since COPT is optimal, it must be using c(A) as the center corresponding to the cluster A. Using the same lemma again, we see E[φ(A)] is given by, X a0∈A 1 |A| · X a∈A ka − a0k 2 ! = 1 |A| X a0∈A X a∈A ka − c(A)k 2 + |A| · ka0 − c(A)k 2 ! = 2X a∈A ka − c(A)k 2 , and the result follows. Our next step is to prove an analog of Lemma 3.1 for the remaining centers, which are chosen with D2 weighting. Lemma 3.2. Let A be an arbitrary cluster in COPT, and let C be an arbitrary clustering. If we add a random center to C from A, chosen with D2 weighting, then E[φ(A)] ≤ 8φOPT(A). Proof. The probability that we choose some fixed a0 as our center, given that we are choosing our center from A, is precisely D(a0) 2 P a∈A D(a) 2 . Furthermore, after choosing the center a0, a point a will contribute precisely min(D(a), ka − a0k) 2 to the potential. Therefore, E[φ(A)] = X a0∈A D(a0) 2 P a∈A D(a) 2 X a∈A min(D(a), ka − a0k) 2 . Note by the triangle inequality that D(a0) ≤ D(a) + ka − a0k for all a, a0. From this, the powermean inequality1 implies that D(a0) 2 ≤ 2D(a) 2 + 2ka − a0k 2 . Summing over all a, we then have that D(a0) 2 ≤ 2 |A| P a∈A D(a) 2 + 2 |A| P a∈Aka − a0k 2 , and hence, E[φ(A)] is at most, 2 |A| · X a0∈A P a∈A D(a) 2 P a∈A D(a) 2 · X a∈A min(D(a), ka − a0k) 2 + 2 |A| · X a0∈A P a∈Aka − a0k 2 P a∈A D(a) 2 · X a∈A min(D(a), ka − a0k) 2 . In the first expression, we substitute min(D(a), ka − a0k) 2 ≤ ka − a0k 2 , and in the second expression, we substitute min(D(a), ka − a0k) 2 ≤ D(a) 2 . Simplifying, we then have, E[φ(A)] ≤ 4 |A| · X a0∈A X a∈A ka − a0k 2 = 8φOPT(A). The last step here follows from Lemma 3.1. We have now shown that seeding by D2 weighting is competitive as long as it chooses centers from each cluster of COPT, which completes the first half of our argument. We now use induction to show the total error in general is at most O(log k). 1The power-mean inequality states for any real numbers a1, · · · , am that Σa 2 i ≥ 1 m (Σai) 2 . It follows from the CauchySchwarz inequality. We are only using the case m = 2 here, but we will need the general case for Lemma 3.3
LEMMA 3.3.Let C be an arbitrary clustering.Choose follows that our contribution to E[oopT]in this case is u >0 "uncovered"clusters from Copr,and let u at most denote the set of points in these clusters.Also let X。=X-Xu.Now suppose we addt≤u random centers to C,chosen with D2 weighting.Let C'denote the the .∑p(a+6.+86orx)-8oPra) resulting clustering,and let denote the corresponding potential.Then,Elo]is at most, 1+-)+二(x)-4到)) o()+8oopr(化))·(1+)+"-t.)。 ((x)+8oopr(x)·(1+A-) Here,H:denotes the harmonic sum,1+.+ Proof.We prove this by induction,showing that if the result holds for (t-1,u)and (t-1,u-1),then it The last step here follows from the fact that also holds for (t,u).Therefore,it suffices to check ∑aEAPaOa≤8oopT(A),which is implied by Lemma t=0.u>0 and t=u=1 as our base cases. 3.2. If t =0 and u >0,the result follows from the fact Now,the power-mean inequality implies that that 1+H=t=1.Next,suppose t u 1. ∑Acxp(A2≥是·p(化.)2.Therefore,ifwe.sum over We choose our one new center from the one uncovered all uncovered clusters A,we obtain a potential contri- cluster with probability exactly.In this case, bution of at most Lemma 3.2 guarantees that El]<(e)+8oopT(Yu). Since d'<o even if we choose a center from a covered .(e(x)+80opm()小(1+Hi-) cluster,we have. E[≤ .()+84orx)+2. a(exP-xr (比u) ≤2(Xe)+8ooPr(X): (((X)+8or(比)·(1+H-) Since 1+H=2 here,we have shown the result holds for both base cases. We now proceed to prove the inductive step.It is Combining the potential contribution to Eo]from convenient here to consider two cases.First suppose we both cases,we now obtain the desired bound: choose our first center from a covered cluster.As above, this happens with probability exactly.Note that this new center can only decrease o.Bearing this in E[]≤((X)+8oPr(比u)·(1+H-1) mind,apply the inductive hypothesis with the same t u-t choice of covered clusters,but with t decreased by one. (比)+).(比) It follows that our contribution to Elo']in this case is at most, ≤(()+8oom()(1+-1+》 ()+8aoer化)-1+H- (Xc) t). u +u=t+1.x) The inductive step now follows from the fact that是≤是 u We specialize Lemma 3.3 to obtain our main result. On the other hand,suppose we choose our first center from some uncovered cluster A.This happens Theorem 3.1 If C is constructed with k-means++, with probability.Let pa denote the probability then the corresponding potential function satisfies that we choose aA as our center,given the center is Elo]<8(Ink+2)ooPT. somewhere in A,and let oa denote (A)after we choose a as our center.Once again,we apply our inductive Proof.Consider the clustering C after we have com- hypothesis,this time adding A to the set of covered pleted Step 1.Let A denote the Copr cluster in which clusters,as well as decreasing both t and u by 1.It we chose the first center.Applying Lemma 3.3 with
Lemma 3.3. Let C be an arbitrary clustering. Choose u > 0 “uncovered” clusters from COPT, and let Xu denote the set of points in these clusters. Also let Xc = X −Xu. Now suppose we add t ≤ u random centers to C, chosen with D2 weighting. Let C 0 denote the the resulting clustering, and let φ 0 denote the corresponding potential. Then, E[φ 0 ] is at most, φ(Xc) + 8φOPT(Xu) · (1 + Ht) + u − t u · φ(Xu). Here, Ht denotes the harmonic sum, 1 + 1 2 + · · · + 1 t . Proof. We prove this by induction, showing that if the result holds for (t − 1, u) and (t − 1, u − 1), then it also holds for (t, u). Therefore, it suffices to check t = 0, u > 0 and t = u = 1 as our base cases. If t = 0 and u > 0, the result follows from the fact that 1 + Ht = u−t u = 1. Next, suppose t = u = 1. We choose our one new center from the one uncovered cluster with probability exactly φ(Xu) φ . In this case, Lemma 3.2 guarantees that E[φ 0 ] ≤ φ(Xc)+8φOPT(Xu). Since φ 0 ≤ φ even if we choose a center from a covered cluster, we have, E[φ 0 ] ≤ φ(Xu) φ · φ(Xc) + 8φOPT(Xu) + φ(Xc) φ · φ ≤ 2φ(Xc) + 8φOPT(Xu). Since 1 + Ht = 2 here, we have shown the result holds for both base cases. We now proceed to prove the inductive step. It is convenient here to consider two cases. First suppose we choose our first center from a covered cluster. As above, this happens with probability exactly φ(Xc) φ . Note that this new center can only decrease φ. Bearing this in mind, apply the inductive hypothesis with the same choice of covered clusters, but with t decreased by one. It follows that our contribution to E[φ 0 ] in this case is at most, φ(Xc) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t + 1 u · φ(Xu) . On the other hand, suppose we choose our first center from some uncovered cluster A. This happens with probability φ(A) φ . Let pa denote the probability that we choose a ∈ A as our center, given the center is somewhere in A, and let φa denote φ(A) after we choose a as our center. Once again, we apply our inductive hypothesis, this time adding A to the set of covered clusters, as well as decreasing both t and u by 1. It follows that our contribution to E[φOPT] in this case is at most, φ(A) φ · X a∈A pa φ(Xc) + φa + 8φOPT(Xu) − 8φOPT(A) · (1 + Ht−1) + u − t u − 1 · φ(Xu) − φ(A) ≤ φ(A) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t u − 1 · φ(Xu) − φ(A) . The last step here follows from the fact that P a∈A paφa ≤ 8φOPT(A), which is implied by Lemma 3.2. P Now, the power-mean inequality implies that A⊂Xu φ(A) 2 ≥ 1 u · φ(Xu) 2 . Therefore, if we sum over all uncovered clusters A, we obtain a potential contribution of at most, φ(Xu) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + 1 φ · u − t u − 1 · φ(Xu) 2 − 1 u · φ(Xu) 2 = φ(Xu) φ · φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t u · φ(Xu) . Combining the potential contribution to E[φ 0 ] from both cases, we now obtain the desired bound: E[φ 0 ] ≤ φ(Xc) + 8φOPT(Xu) · (1 + Ht−1) + u − t u · φ(Xu) + φ(Xc) φ · φ(Xu) u ≤ φ(Xc) + 8φOPT(Xu) · 1 + Ht−1 + 1 u + u − t u · φ(Xu). The inductive step now follows from the fact that 1 u ≤ 1 t . We specialize Lemma 3.3 to obtain our main result. Theorem 3.1 If C is constructed with k-means++, then the corresponding potential function φ satisfies E[φ] ≤ 8(ln k + 2)φOPT. Proof. Consider the clustering C after we have completed Step 1. Let A denote the COPT cluster in which we chose the first center. Applying Lemma 3.3 with
t=u=k-1 and with A being the only covered clus-Finally,,since no2u≥u·是·62.B and no2u≥n62H3, ter,we have. we have. E6opml≤(④+8or-86oPr)1+l-ik≥a(d2.1+)B+(层A2-2ns). The result now follows from Lemma 3.1,and from the This completes the base case. fact that Hk.-l≤1+lnk. We now proceed to prove the inductive step.As with Lemma 3.3,we consider two cases.The probability 4 A matching lower bound that our first center is chosen from an uncovered cluster In this section,we show that the D2 seeding used is, by k-means++is no better than (logk)-competitive in expectation,thereby proving Theorem 3.1 is tight u·是·△2 within a constant factor. u…是·△2+(k-w)装·62-(k-t)62 Fixk,and then choose n,△,6 with n>kand△≥ u△2 6.We construct with n points.First choose k centers u△2+(k-u)62 c1,c2,,ck such that llc-c2=△2-(=)·62 for all ij.Now,for each ci,add data points u△2 0· Ti.1,i.2,..,i arranged in a regular simplex with u△2+(k-u)82 center c,side length and radius6.If we do Applying our inductive hypothesis with t and u both this in orthogonal dimensions for each i,we then have, decreased by 1,we obtain a potential contribution from this case of at least, e-w={&i u42 u△2+(k-)62 …a+1. n62.(1+H-1) We prove our seeding technique is in expectation (logk)worse than the optimal clustering in this case. Clearly,the optimal clustering has centers {ci +(42-2n)u-t月 which leads to an optimal potential of opr=n.62 The probability that our first center is chosen from Conversely,using an induction similar to that of Lemma a covered cluster is 3.3,we show D2 seeding cannot match this bound.As before,we bound the expected potential in terms of (k-)·是·62-(k-t)62 the number of centers left to choose and the number u·是·△2+(k-四)·装·62-(k-t)2 of uncovered clusters (those clusters of Co from which (k-四)·是·62-(k-t)62.(k-u)62 we have not chosen a center). (k-u)·朵·62 u△2+(k-u)82 LEMMA 4.1.Let C be an arbitrary clustering on X with (k-u)62 k-t 1 centers,but with u clusters from Copr 2a· u△2+(k-u)62 uncovered.Now suppose we add t random centers to C,chosen with D2 weighting.Let c denote thethe Applying our inductive hypothesis with t decreased by 1 resulting clustering,and let denote the corresponding but with u constant,we obtain a potential contribution potential. from this case of at least, Furthermore,let a=,B=and a2+k-四·a+1(n2.(1+)月 (-u)62 H=∑1号.Then,Eo]is at least, a+1.(n2.1+)3+(RA2-2m2)(u-) +(侯A2-2m)u-t+0) Proof.We prove this by induction on t.If t=0,note that. Therefore,Elo]is at least, ==(a-u君-P+uRa2 a+1.(n2.1+)B+(A2-2n62)u-刊) ot+ Since n--u:是之是,we have"m≥学=a.Also, a,B≤1.Therefore, +aft-四原广(-u-(层a2-2n) ≥a(-u)P+u天a) -uA2.(Hr四-H'u-)n2.B
t = u = k − 1 and with A being the only covered cluster, we have, E[φOPT] ≤ φ(A) + 8φOPT − 8φOPT(A) · (1 + Hk−1). The result now follows from Lemma 3.1, and from the fact that Hk−1 ≤ 1 + ln k. 4 A matching lower bound In this section, we show that the D2 seeding used by k-means++ is no better than Ω(log k)-competitive in expectation, thereby proving Theorem 3.1 is tight within a constant factor. Fix k, and then choose n, ∆, δ with n k and ∆ δ. We construct X with n points. First choose k centers c1, c2, . . . , ck such that kci − cjk 2 = ∆2 − n−k n · δ 2 for all i 6= j. Now, for each ci , add data points xi,1, xi,2, · · · , xi, n k arranged in a regular simplex with center ci , side length δ, and radius q n−k 2n · δ. If we do this in orthogonal dimensions for each i, we then have, kxi,i0 − xj,j0k = δ if i=j, or ∆ otherwise. We prove our seeding technique is in expectation Ω(log k) worse than the optimal clustering in this case. Clearly, the optimal clustering has centers {ci}, which leads to an optimal potential of φOPT = n−k 2 · δ 2 . Conversely, using an induction similar to that of Lemma 3.3, we show D2 seeding cannot match this bound. As before, we bound the expected potential in terms of the number of centers left to choose and the number of uncovered clusters (those clusters of C0 from which we have not chosen a center). Lemma 4.1. Let C be an arbitrary clustering on X with k − t ≥ 1 centers, but with u clusters from COPT uncovered. Now suppose we add t random centers to C, chosen with D2 weighting. Let C 0 denote the the resulting clustering, and let φ 0 denote the corresponding potential. Furthermore, let α = n−k 2 n , β = ∆2−2kδ2 ∆2 and H0 u = Pu i=1 k−i ki . Then, E[φ 0 ] is at least, α t+1 · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · (u − t) . Proof. We prove this by induction on t. If t = 0, note that, φ 0 = φ = n − u · n k − k · δ 2 + u · n k · ∆2 . Since n−u· n k ≥ n k , we have n−u· n k −k n−u· n k ≥ n k −k n k = α. Also, α, β ≤ 1. Therefore, φ 0 ≥ α · n − u · n k · δ 2 · β + u · n k · ∆2 . Finally, since nδ2u ≥ u · n k · δ 2 · β and nδ2u ≥ nδ2H0 uβ, we have, φ 0 ≥ α · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · u . This completes the base case. We now proceed to prove the inductive step. As with Lemma 3.3, we consider two cases. The probability that our first center is chosen from an uncovered cluster is, u · n k · ∆2 u · n k · ∆2 + (k − u) · n k · δ 2 − (k − t)δ 2 ≥ u∆2 u∆2 + (k − u)δ 2 ≥ α · u∆2 u∆2 + (k − u)δ 2 . Applying our inductive hypothesis with t and u both decreased by 1, we obtain a potential contribution from this case of at least, u∆2 u∆2 + (k − u)δ 2 · α t+1 · nδ2 · (1 + H0 u−1 ) · β + n k ∆2 − 2nδ2 · (u − t) . The probability that our first center is chosen from a covered cluster is (k − u) · n k · δ 2 − (k − t)δ 2 u · n k · ∆2 + (k − u) · n k · δ 2 − (k − t)δ 2 ≥ (k − u) · n k · δ 2 − (k − t)δ 2 (k − u) · n k · δ 2 · (k − u)δ 2 u∆2 + (k − u)δ 2 ≥ α · (k − u)δ 2 u∆2 + (k − u)δ 2 . Applying our inductive hypothesis with t decreased by 1 but with u constant, we obtain a potential contribution from this case of at least, (k − u)δ 2 u∆2 + (k − u)δ 2 · α t+1 · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · (u − t + 1) . Therefore, E[φ 0 ] is at least, α t+1 · nδ2 · (1 + H0 u ) · β + n k ∆2 − 2nδ2 · (u − t) + α t+1 u∆2 + (k − u)δ 2 · (k − u)δ 2 · n k ∆2 − 2nδ2 − u∆2 · H0 (u) − H0 (u − 1) · nδ2 · β