Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao,Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology,Nanjing University,China zhaosy, zhanggd)elamda.nju.edu.cn,liwujun@nju.edu.cn Abstract 2013)to solve it.Many works (Roux,Schmidt,and Bach 2012:Shalev-Shwartz and Zhang 2013:Johnson and Zhang Stochastic gradient descent (SGD)and its variants have at- tracted much attention in machine learning due to their ef- 2013)have found that SGD-based methods can achieve ficiency and effectiveness for optimization.To handle large- promising performance in large-scale learning problem- scale problems,researchers have recently proposed several s.According to the implementation platforms or system- lock-free strategy based parallel SGD (LF-PSGD)method- s,existing SGD-based methods can be divided into three s for multi-core systems.However,existing works have on- categories:sequential SGD (SSGD)methods,parallel S- ly proved the convergence of these LF-PSGD methods for GD (PSGD)methods.and distributed SGD (DSGD)meth- convex problems.To the best of our knowledge,no work ods.SSGD methods are designed for a single thread on has proved the convergence of the LF-PSGD methods for a single machine,PSGD methods are designed for multi- non-convex problems.In this paper,we provide the theoret- core (multi-thread)on a single machine with a shared memo- ical proof about the convergence of two representative LF- PSGD methods.Hogwild!and AsySVRG.for non-convex ry,and DSGD methods are designed for multiple machines. problems.Empirical results also show that both Hogwild!and When the problem in (1)is convex,the SGD meth- AsySVRG are convergent on non-convex problems,which ods,including SSGD (Roux,Schmidt.and Bach 2012: successfully verifies our theoretical results. Shalev-Shwartz and Zhang 2013;Johnson and Zhang 2013), PSGD (Recht et al.2011)and DSGD (Jaggi et al.2014; Li et al.2014;Xing et al.2015;Zhang,Zheng,and K- Introduction wok 2016),have achieved very promising empirical perfor- Many machine learning models can be formulated as the fol- mance.Furthermore,good theoretical results about the con- lowing optimization problem: vergence of the SGD methods are also provided by these existing works. 1>f(w) min (1 In many real applications,the problems to optimize can be i=1 non-convex.For example,the problems for the neural net- works are typically non-convex.Because many researcher- where w is the parameter to learn (optimize),n is the num s (Li et al.2014;Xing et al.2015)find that the SGD ber of training instances,fi(w)is the loss defined on in- methods can also achieve good empirical results for non- stance i.For example,assuming we are given a set of la- convex problems,theoretical proof about the convergence beled instances {(xy)i=1,2,...,n},where xiERd is of SGD methods for non-convex problems has recently at- the feature vector and yi{1,-1}is the label of xi,fi(w) tracted much attention.Some progress has been achieved can be log(1+e)+wll2 which is known as the For example,the works in (Ghadimi and Lan 2013:Red- regularized loss in logistic regression(LR).We can also take di et al.2016:Li et al.2016:Allen-Zhu and Hazan 2016: fi(w)to be max{0,1-w}+w2 which is known as Allen-Zhu and Yuan 2016)have proved the convergence of the regularized loss in support vector machine(SVM).Here, the sequential SGD and its variants for non-convex prob- A is the regularization hyper-parameter.Moreover,many lems.There are also some other theoretical results for some other machine learning models,including neural network- particular non-convex problems,like PCA (Shamir 2015; s(Krizhevsky,Sutskever,and Hinton 2012),matrix factor- 2016a;2016b)and matrix factorization (Sa,Re,and Oluko- ization (Koren,Bell,and Volinsky 2009),and principal com- tun 2015).But all these works are only for SSGD methods. ponent analysis (PCA)(Shamir 2015)and so on,can also be There have appeared only two works (Lian et al.2015: formulated as that in (1). Huo and Huang 2016)which propose PSGD methods for When the problem in (1)is large-scale,i.e.,n is large, non-convex problems with theoretical proof of convergence. researchers have recently proposed stochastic gradient de- scent(SGD)and its variants like SVRG (Johnson and Zhang IIn some literatures,PSGD refers to the methods implemented on both multi-core and multi-machine systems.In this paper,PS- Copyright c)2017,Association for the Advancement of Artificial GD only refers to the methods implemented on multi-core systems Intelligence (www.aaai.org).All rights reserved. with a shared memory
Lock-Free Optimization for Non-Convex Problems Shen-Yi Zhao, Gong-Duo Zhang and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer Science and Technology, Nanjing University, China {zhaosy, zhanggd}@lamda.nju.edu.cn, liwujun@nju.edu.cn Abstract Stochastic gradient descent (SGD) and its variants have attracted much attention in machine learning due to their ef- ficiency and effectiveness for optimization. To handle largescale problems, researchers have recently proposed several lock-free strategy based parallel SGD (LF-PSGD) methods for multi-core systems. However, existing works have only proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LFPSGD methods, Hogwild! and AsySVRG, for non-convex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Introduction Many machine learning models can be formulated as the following optimization problem: min w 1 n Xn i=1 fi(w), (1) where w is the parameter to learn (optimize), n is the number of training instances, fi(w) is the loss defined on instance i. For example, assuming we are given a set of labeled instances {(xi , yi)|i = 1, 2, . . . , n}, where xi ∈ R d is the feature vector and yi ∈ {1, −1} is the label of xi , fi(w) can be log(1 + e −yix T i w) + λ 2 kwk 2 which is known as the regularized loss in logistic regression (LR). We can also take fi(w)to be max{0, 1−yix T i w}+ λ 2 kwk 2 which is known as the regularized loss in support vector machine (SVM). Here, λ is the regularization hyper-parameter. Moreover, many other machine learning models, including neural networks (Krizhevsky, Sutskever, and Hinton 2012), matrix factorization (Koren, Bell, and Volinsky 2009), and principal component analysis (PCA) (Shamir 2015) and so on, can also be formulated as that in (1). When the problem in (1) is large-scale, i.e., n is large, researchers have recently proposed stochastic gradient descent (SGD) and its variants like SVRG (Johnson and Zhang Copyright c 2017, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. 2013) to solve it. Many works (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013) have found that SGD-based methods can achieve promising performance in large-scale learning problems. According to the implementation platforms or systems, existing SGD-based methods can be divided into three categories: sequential SGD (SSGD) methods, parallel SGD (PSGD) methods, and distributed SGD (DSGD) methods. SSGD methods are designed for a single thread on a single machine, PSGD methods are designed for multicore (multi-thread) on a single machine with a shared memory1 , and DSGD methods are designed for multiple machines. When the problem in (1) is convex, the SGD methods, including SSGD (Roux, Schmidt, and Bach 2012; Shalev-Shwartz and Zhang 2013; Johnson and Zhang 2013), PSGD (Recht et al. 2011) and DSGD (Jaggi et al. 2014; Li et al. 2014; Xing et al. 2015; Zhang, Zheng, and Kwok 2016), have achieved very promising empirical performance. Furthermore, good theoretical results about the convergence of the SGD methods are also provided by these existing works. In many real applications, the problems to optimize can be non-convex. For example, the problems for the neural networks are typically non-convex. Because many researchers (Li et al. 2014; Xing et al. 2015) find that the SGD methods can also achieve good empirical results for nonconvex problems, theoretical proof about the convergence of SGD methods for non-convex problems has recently attracted much attention. Some progress has been achieved. For example, the works in (Ghadimi and Lan 2013; Reddi et al. 2016; Li et al. 2016; Allen-Zhu and Hazan 2016; Allen-Zhu and Yuan 2016) have proved the convergence of the sequential SGD and its variants for non-convex problems. There are also some other theoretical results for some particular non-convex problems, like PCA (Shamir 2015; 2016a; 2016b) and matrix factorization (Sa, Re, and Olukotun 2015). But all these works are only for SSGD methods. There have appeared only two works (Lian et al. 2015; Huo and Huang 2016) which propose PSGD methods for non-convex problems with theoretical proof of convergence. 1 In some literatures, PSGD refers to the methods implemented on both multi-core and multi-machine systems. In this paper, PSGD only refers to the methods implemented on multi-core systems with a shared memory
However,the PSGD methods in (Lian et al.2015)need This is a common assumption for the convergence analy- write-lock or atomic operation for the memory to prove sis of most existing gradient-based methods. the convergence 2.Similarly,the work in (Huo and Huang Since we focus on non-convex problems in this paper,it 2016)also does not prove the convergence for the lock- is difficult to get the global solution of(1)based on the gra- free case in our paper.Recent works (Recht et al.2011; dient methods.Hence,we use llVf(w)2 to measure the Chaturapruek,Duchi,and Re 2015;J.Reddi et al.2015; convergence instead of f(w)-min f(w). Zhao and Li 2016)find that lock-free strategy based parallel SGD(LF-PSGD)methods can empirically outperform lock- Here,we give a Lemma which is useful in the conver- based PSGD methods for multi-core systems.Although gence analysis of Hogwild!and AsySVRG. some existing works(Chaturapruek,Duchi,and Re 2015; Lemma 1.Assume B is a positive semi-definite matrix with Zhao and Li 2016)have proved the convergence of these LF- the largest eigenvalue less than or equal to I and the mini- PSGD methods for convex problems,no work has proved mum eigenvalue a >0,we have:Vx,y, the convergence of the LF-PSGD methods for non-convex problems. In this paper,we provide the theoretical proof about the -vf✉Bfw)≤乞Ix-yP-INfx)P. convergence of two representative LF-PSGD methods,Hog- Proof. wild!(Recht et al.2011;Chaturapruek,Duchi,and Re 2015) and AsySVRG(Zhao and Li 2016),for non-convex prob- lems.The contribution of this work can be outlined as fol- 含IVfP-V/(*)Bv5 lows: ≤号Bvf-Vf(x)BV/) .Theoretical results show that both Hogwild!and AsySVRG can converge with lock-free strategy for non- convex problems. ≤号Bfx-vf(x)BVf()+Bfy) .Hogwild!gets a convergence rate of (1/)for non- -号IB)-f川 convex problems,where T=p x T is the total iteration number of p threads. s号k-R AsySVRG gets a convergence rate of O(1/T)for non- 口 convex problems. To get an e-local optimal solution for AsySVRG,the Hogwild!for Non-Convex Problems computation complexity by all threads is O(n/e),or e- The Hogwild!method(Recht et al.2011)is listed in Algo- quivalently the computation complexity of each thread is rithm 1.Each thread reads w from the shared memory,com- ()This is faster than traditional parallel gradient de- putes a stochastic gradient and updates the w in the shared cent methods whose computation complexity is O()for memory.Please note that Hogwild!in(Recht et al.2011)has each thread. several variants with locks or lock-free.Here,we only focus on the lock-free variant of Hogwild!,which means that we Empirical results also show that both Hogwild!and do not use any locks,either read-lock or write-lock,for all AsySVRG are convergent on non-convex problems, threads which successfully verifies our theoretical results. Algorithm 1 Hogwild! Preliminary Initialization:p threads,initialize wo,n: We use f(w)to denote the objective function in(1),which For each thread,do: means f(w)=∑-1fi(w.And we use·lto denote for /=0,1,2,...,T-1 do theL2-noml·2. Read current w in the shared memory,denoted as w: Randomly pick up an i from 1,...,n and compute the gra- Assumption 1.The function fi()in (1)is smooth,which dient Vfi(w); means that there exists a constant L>0.Va,b, w←w-Vf(w)月 end for f:(b)sfi(a)+Vf:(a)"(b-a)+lb-all2 As in (Zhao and Li 2016).we can construct an equivalent or equivalently write sequence w:: llVfi(b)-Vfi(a)Llb-all Wi+1=Wi-nBiVfi,(wi), (2) 2Although the implementation of AsySG-incon in (Lian et al. where 0<t pxT,B:is a random diagonal matrix whose 2015)is lock-free,the theoretical analysis about the convergence diagonal entries are 0 or 1.The B:is used to denote whether of AsySG-incon is based on an assumption that no over-writing over-writing happens.If the kth diagonal entry of B:is 0,it happens,i.e.,the theoretical analysis is not for the lock-free case. means that the kth element in the gradient vector Vfi(wt)
However, the PSGD methods in (Lian et al. 2015) need write-lock or atomic operation for the memory to prove the convergence 2 . Similarly, the work in (Huo and Huang 2016) also does not prove the convergence for the lockfree case in our paper. Recent works (Recht et al. 2011; Chaturapruek, Duchi, and Re 2015; J. Reddi et al. 2015; ´ Zhao and Li 2016) find that lock-free strategy based parallel SGD (LF-PSGD) methods can empirically outperform lockbased PSGD methods for multi-core systems. Although some existing works (Chaturapruek, Duchi, and Re 2015; ´ Zhao and Li 2016) have proved the convergence of these LFPSGD methods for convex problems, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hogwild! (Recht et al. 2011; Chaturapruek, Duchi, and Re 2015) ´ and AsySVRG (Zhao and Li 2016), for non-convex problems. The contribution of this work can be outlined as follows: • Theoretical results show that both Hogwild! and AsySVRG can converge with lock-free strategy for nonconvex problems. • Hogwild! gets a convergence rate of O(1/ p T˜) for nonconvex problems, where T˜ = p × T is the total iteration number of p threads. • AsySVRG gets a convergence rate of O(1/T˜) for nonconvex problems. • To get an -local optimal solution for AsySVRG, the computation complexity by all threads is O(n 2 3 /), or equivalently the computation complexity of each thread is O( n 2 3 p ). This is faster than traditional parallel gradient decent methods whose computation complexity is O( n p ) for each thread. • Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results. Preliminary We use f(w) to denote the objective function in (1), which means f(w) = 1 n Pn i=1 fi(w). And we use k · k to denote the L2-norm k · k2. Assumption 1. The function fi(·) in (1) is smooth, which means that there exists a constant L > 0, ∀a, b, fi(b) ≤ fi(a) + ∇fi(a) T (b − a) + L 2 kb − ak 2 , or equivalently k∇fi(b) − ∇fi(a)k ≤ Lkb − ak. 2Although the implementation of AsySG-incon in (Lian et al. 2015) is lock-free, the theoretical analysis about the convergence of AsySG-incon is based on an assumption that no over-writing happens, i.e., the theoretical analysis is not for the lock-free case. This is a common assumption for the convergence analysis of most existing gradient-based methods. Since we focus on non-convex problems in this paper, it is difficult to get the global solution of (1) based on the gradient methods. Hence, we use k∇f(w)k 2 to measure the convergence instead of f(w) − min w f(w). Here, we give a Lemma which is useful in the convergence analysis of Hogwild! and AsySVRG. Lemma 1. Assume B is a positive semi-definite matrix with the largest eigenvalue less than or equal to 1 and the minimum eigenvalue α > 0, we have: ∀x, y, −∇f(x) T B∇f(y) ≤ L 2 2 kx − yk 2 − α 2 k∇f(x)k 2 . Proof. α 2 k∇f(x)k 2 − ∇f(x) T B∇f(y) ≤ 1 2 B 1 2 ∇f(x) 2 − ∇f(x) T B∇f(y) ≤ 1 2 B 1 2 ∇f(x) 2 − ∇f(x) T B∇f(y) + 1 2 B 1 2 ∇f(y) 2 = 1 2 B 1 2 (∇f(x) − ∇f(y)) 2 ≤ L 2 2 kx − yk 2 . Hogwild! for Non-Convex Problems The Hogwild! method (Recht et al. 2011) is listed in Algorithm 1. Each thread reads w from the shared memory, computes a stochastic gradient and updates the w in the shared memory. Please note that Hogwild! in (Recht et al. 2011) has several variants with locks or lock-free. Here, we only focus on the lock-free variant of Hogwild!, which means that we do not use any locks, either read-lock or write-lock, for all threads. Algorithm 1 Hogwild! Initialization: p threads, initialize w0, η; For each thread, do: for l = 0, 1, 2, ..., T − 1 do Read current w in the shared memory, denoted as wˆ ; Randomly pick up an i from {1, . . . , n} and compute the gradient ∇fi(wˆ ); w ← w − η∇fi(wˆ ); end for As in (Zhao and Li 2016), we can construct an equivalent write sequence {wt}: wt+1 = wt − ηBt∇fit (wˆ t), (2) where 0 ≤ t ≤ p×T, Bt is a random diagonal matrix whose diagonal entries are 0 or 1. The Bt is used to denote whether over-writing happens. If the kth diagonal entry of Bt is 0, it means that the kth element in the gradient vector ∇fit (wˆ t)
is overwritten by other threads.Otherwise,that element is Lemma 3.With the condition about p,n in Lemma 2,we not overwritten. have wt is read by the thread who computes Vfi,(wt)and has the following format: E4-,P≤4rpp'-业g) p-1 (4) Combining with Assumption 5,we can find that the gap of wt=Wa(t) Pt.j-a(t)Vfi,(wj) (3) the write sequence and read sequence can always be bound- j=a(t) ed by a constantnr where a(t)means that some old stochastic gradients have been completely written on the w in the shared memory. Theorem 1.Let A=21(w)and B=2V2( a(p-1) Pjt)is a diagonal matrix whose diagonal entries are 0 会If we take the stepsize n=√where T-pxT, 「4 or 1,which means w:might include parts of new stochastic gradients. we can get the following result: In the lock-free strategy,we need the following assump- 一1 tions to guarantee convergence: AB Ivf(wP≤V Assumption2.a(t)is bounded by:0≤t-a(t)≤T It means that the old stochastic gradients Proof.According to Assumption 1,we have Vfio,...,Vfihave been completely written on w in the shared memory. E[f(wt+i)小wt,w] Assumption 3.We consider the matrix Bt as a random ma- <f(w:)-nE[Vf(w:)TB:Vfi(w:)lwt,w:] trix and EBwt,wt=B 0 with the minimum eigen- value a 0. +牙v,副 According to the definition of B:,it is easy to find B:,B =f(wt)-nVf(w:)TBVf(w:) are positive semi-definite matrices and the largest eigenvalue of B is less than or equal to 1.Assumption 3 means that the +牙v.(时 probability that over-writing happens is at most 1-a <1 for each write step. ≤fw)-受Ivf0wP+w-a4P Assumption 4.Bt and it are independent. Since it is the random index selected by each thread while E了j.()w,, 十2 B:is highly affected by the hardware,the independence as- sumption is reasonable. where the first equality uses Assumption 4,the second in- For Hogwild!,the following assumption is also necessary: equality uses Lemma 1.Taking expectation on the above in- equality,we obtain Assumption 5.There exists a constant V,Vfi(w) V,i=1,,n. Ef(wt+1) For convenience.in this section.we denote ≤B-受wP+空2-aP q(x)= ∑f(x)2 Ln2v2 + n i=1 2 It is easy to find that Eg(w:)=ElVfi (w)2]and note ≤gf)-罗NwP that when x is close to some stationary point,g(x)may still be far away from 0.Hence,it is not a variance reduction +ve2Bnm-D+克 method and we need to control the variance of the stochastic p-1 gradient. where the first inequality uses Assumption 5 and second in- The difficulty of the analysis is wt wt.Here,we give equality uses Lemma 3.Summing the above inequality from the following Lemmas 3: t=0 to T-1,we get Lemma 2.In Hogwild!,we have Eq(wt)<pEq(wt+1)if 7-1 p,n satisfy ∑ElVf(w)I2 t=0 1-7-物+1L2(or+1-D≤p. 2 p-1 ≤2f0wo)+2niv2(2rIor-卫+ an a(p-1) 20 3The proof of some Lemmas can be found in the supplemen- For convenience,let A 2f(wo)and B tary material,which can be downloaded from http://cs.nju. a edu.cn/1wj/paper/LFnonConvex_sup.pdf. 2).which are two bounded constants. a(p-1)
is overwritten by other threads. Otherwise, that element is not overwritten. wˆ t is read by the thread who computes ∇fit (wˆ t) and has the following format: wˆ t = wa(t) − η Xt−1 j=a(t) Pt,j−a(t)∇fij (wˆ j ), (3) where a(t) means that some old stochastic gradients have been completely written on the w in the shared memory. Pt,j−a(t) is a diagonal matrix whose diagonal entries are 0 or 1, which means wˆ t might include parts of new stochastic gradients. In the lock-free strategy, we need the following assumptions to guarantee convergence: Assumption 2. a(t) is bounded by: 0 ≤ t − a(t) ≤ τ It means that the old stochastic gradients ∇fi0 , . . . , ∇fit−τ−1 have been completely written on w in the shared memory. Assumption 3. We consider the matrix Bt as a random matrix and E[Bt|wt, wˆ t] = B 0 with the minimum eigenvalue α > 0. According to the definition of Bt, it is easy to find Bt, B are positive semi-definite matrices and the largest eigenvalue of B is less than or equal to 1. Assumption 3 means that the probability that over-writing happens is at most 1 − α < 1 for each write step. Assumption 4. Bt and it are independent. Since it is the random index selected by each thread while Bt is highly affected by the hardware, the independence assumption is reasonable. For Hogwild!, the following assumption is also necessary: Assumption 5. There exists a constant V , k∇fi(w)k ≤ V, i = 1, . . . , n. For convenience, in this section, we denote q(x) = 1 n Xn i=1 kfi(x)k 2 . It is easy to find that Eq(wˆ t) = E[k∇fit (wˆ t)k 2 ] and note that when x is close to some stationary point, q(x) may still be far away from 0. Hence, it is not a variance reduction method and we need to control the variance of the stochastic gradient. The difficulty of the analysis is wt 6= wˆ t. Here, we give the following Lemmas 3 : Lemma 2. In Hogwild!, we have Eq(wˆ t) ≤ ρEq(wˆ t+1) if ρ, η satisfy 1 1 − η − 9η(τ+1)L2(ρτ+1−1) ρ−1 ≤ ρ. 3The proof of some Lemmas can be found in the supplementary material, which can be downloaded from http://cs.nju. edu.cn/lwj/paper/LFnonConvex_sup.pdf. Lemma 3. With the condition about ρ, η in Lemma 2, we have Ekwt − wˆ tk 2 ≤ 4η 2 τ ρ(ρ τ − 1) ρ − 1 Eq(wˆ t) (4) Combining with Assumption 5, we can find that the gap of the write sequence and read sequence can always be bounded by a constant 4η 2V 2 τρ(ρ τ −1) ρ−1 . Theorem 1. Let A = 2f(w0) α and B = 2V 2 ( 2τL2ηρ(ρ τ −1) α(ρ−1) + L 2α ). If we take the stepsize η = q A T B˜ , where T˜ = p × T, we can get the following result: 1 T˜ T X˜−1 t=0 Ek∇f(wt)k 2 ≤ r AB T˜ . Proof. According to Assumption 1, we have E[f(wt+1)|wt, wˆ t] ≤f(wt) − ηE[∇f(wt) T Bt∇fit (wˆ t)|wt, wˆ t] + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t] =f(wt) − η∇f(wt) T B∇f(wˆ t) + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t] ≤f(wt) − αη 2 k∇f(wt)k 2 + L 2η 2 kwt − wˆ tk 2 + Lη2 2 E[k∇fit (wˆ t)k 2 |wt, wˆ t], where the first equality uses Assumption 4, the second inequality uses Lemma 1. Taking expectation on the above inequality, we obtain Ef(wt+1) ≤Ef(wt) − αη 2 Ek∇f(wt)k 2 + L 2η 2 Ekwt − wˆ tk 2 + Lη2V 2 2 ≤Ef(wt) − αη 2 Ek∇f(wt)k 2 + η 2V 2 ( 2τL2ηρ(ρ τ − 1) ρ − 1 + L 2 ), where the first inequality uses Assumption 5 and second inequality uses Lemma 3. Summing the above inequality from t = 0 to T˜ − 1, we get T X˜−1 t=0 Ek∇f(wt)k 2 ≤ 2 αη f(w0) + 2ηT V˜ 2 ( 2τL2ηρ(ρ τ − 1) α(ρ − 1) + L 2α ). For convenience, let A = 2f(w0) α and B = 2V 2 ( 2τL2ηρ(ρ τ −1) α(ρ−1) + L 2α ), which are two bounded constants
If we take the get since all the stochastic gradients have been written on w at the end of the tth outer-loop.Here,we also need the assump- 7-1 ions:0≤m-a(m)≤T;EB,mlut,m,it.m]=B>0 with the minimum eigenvalue a 0;Bt.m and it.m are t=0 independent.These assumptions are similar to those in the previous section. For convenience,let pi(x)=Vfi(x)-Vfi(ut.o)+ Hence,our theoretical result shows that Hogwild!with Vf(ut.o),and in this section,we denote 2 lock-free strategy gets a convergence rate of O(1/VT)for non-convex problems,where T=px T is the total iteration x=∑Ipx)2 i=1 number of p threads. It easy to find that Eq(it.m)=E[vt.m]. AsySVRG for Non-Convex Problems The difference between Hogwild!and AsySVRG is the s- tochastic gradient and we have the following Lemmas which The AsySVRG method(Zhao and Li 2016)is listed in Algo- lead to fast convergence rate of AsySVRG: rithm 2.AsySVRG provides a lock-free parallel strategy for the original sequential SVRG (Johnson and Zhang 2013). Lemma 4.Vx,we have Compared with Hogwild!,AsySVRG includes the full gra- q(x)<2L2x-ut.oll2+21Vf(x)2. dient to get a variance reduced stochastic gradient,which Proof. has been proved to have linear convergence rate on strong- ly convex problems (Zhao and Li 2016).In this section,we will prove that AsySVRG is also convergent for non-convex q(x)= ∑Ii网-io)+a,olr problems,and has faster convergence rate than Hogwild!on non-convex problems. 2∑INfo-Vjiu:o)+vfuo)-fxP n i=1 Algorithm 2 AsySVRG Initialization:p threads,initialize wo,n: +27f(x)I2 fort =0.1.2....T-1 do uo W; ≤2∑INfx-Vu:oP+2IfxP n All threads parallelly compute the full gradient Vf(uo)= 2=1 a∑17f(uo)方 ≤2L21x-u.ol2+27fx)2 u=Wt; For each thread,do: 0 forj=0to M-1 do Read current value of u,denoted as u,from the shared According to Lemma 4,we can find that AsySVRG is memory.And randomly pick up an i from 1,...,n; a variance reduction method for non-convex problems,be- Compute the update vector:=Vfi(u)-Vfi(uo)+ cause when ut.m,ut.o get close to some stationary point, Vf(uo); g(ut,m)gets close to 0.And hence we do not need the u←←u-V时 bounded gradient assumption for the convergence proof. end for Since ut.m ut.m,the difficulty of convergence analy- Take w+1 to be the current value of u in the shared memory; sis lies in the gap between ut.m and ut.m.and the relation end for between q(it.m)and q(ut.m). Similar to the analysis in the last section,we construct an Lemma 5.In AsySVRG,we have Eq(ut.m)<pEq(ft.m+1) equivalent write sequence {ut.m}for the tth outer-loop: if we choose p and n to satisfy that 1 ut,0 Wt, 1-7-9n+1L2p*+1-D≤p. p-1 ut.m+1 ut.m -nBt.mVt.m; (5 Lemma 6.With the condition about p,n in Lemma 5,we where vt.m =Vfit.(ut.m)-Vfit.(ut.o)+Vf(ut.). have Bi.m is a diagonal matrix whose diagonal entries are 0 or 1. And ut.m is read by the thread who computes vt.m.It has EIwm-nP≤p0-DEdt.m.)同 the following format: p-1 Lemma 7.With the condition about p,n in Lemma 5.we have Eq(ut.m)<pEq(ut,m). it,m=ut,a(m)-刃 P(t) m.j-a(m)Vt j; Combining Lemma 6 and Lemma 7,we can directly ob- 1=a(m】 tain: where P is a diagonal matrix whose diagonal en llfm-mPsP-DB4lum小. tries are 0 or 1.Note that according to (5).ut.M=Wt+1 p-1
If we take the stepsize η = q A T B˜ , we get 1 T˜ T X˜−1 t=0 Ek∇f(wt)k 2 ≤ r AB T˜ . Hence, our theoretical result shows that Hogwild! with lock-free strategy gets a convergence rate of O(1/ p T˜) for non-convex problems, where T˜ = p×T is the total iteration number of p threads. AsySVRG for Non-Convex Problems The AsySVRG method (Zhao and Li 2016) is listed in Algorithm 2. AsySVRG provides a lock-free parallel strategy for the original sequential SVRG (Johnson and Zhang 2013). Compared with Hogwild!, AsySVRG includes the full gradient to get a variance reduced stochastic gradient, which has been proved to have linear convergence rate on strongly convex problems (Zhao and Li 2016). In this section, we will prove that AsySVRG is also convergent for non-convex problems, and has faster convergence rate than Hogwild! on non-convex problems. Algorithm 2 AsySVRG Initialization: p threads, initialize w0, η; for t = 0, 1, 2, ...T − 1 do u0 = wt; All threads parallelly compute the full gradient ∇f(u0) = 1 n Pn i=1 ∇fi(u0); u = wt; For each thread, do: for j = 0 to M − 1 do Read current value of u, denoted as uˆ, from the shared memory. And randomly pick up an i from {1, . . . , n}; Compute the update vector: vˆ = ∇fi(uˆ) − ∇fi(u0) + ∇f(u0); u ← u − ηvˆ; end for Take wt+1 to be the current value of u in the shared memory; end for Similar to the analysis in the last section, we construct an equivalent write sequence {ut,m} for the t th outer-loop: ut,0 = wt, ut,m+1 = ut,m − ηBt,mvˆt,m, (5) where vˆt,m = ∇fit,m(uˆt,m) − ∇fit,m(ut,0) + ∇f(ut,0). Bt,m is a diagonal matrix whose diagonal entries are 0 or 1. And uˆt,m is read by the thread who computes vˆt,m. It has the following format: uˆt,m = ut,a(m) − η mX−1 j=a(m) P (t) m,j−a(m) vˆt,j , where P (t) m,j−a(m) is a diagonal matrix whose diagonal entries are 0 or 1. Note that according to (5), ut,M˜ = wt+1 since all the stochastic gradients have been written on w at the end of the t th outer-loop. Here, we also need the assumptions: 0 ≤ m − a(m) ≤ τ ; E[Bt,m|ut,m, uˆt,m] = B 0 with the minimum eigenvalue α > 0; Bt,m and it,m are independent. These assumptions are similar to those in the previous section. For convenience, let pi(x) = ∇fi(x) − ∇fi(ut,0) + ∇f(ut,0), and in this section, we denote q(x) = 1 n Xn i=1 kpi(x)k 2 . It easy to find that Eq(uˆt,m) = E[kvˆt,mk 2 ]. The difference between Hogwild! and AsySVRG is the stochastic gradient and we have the following Lemmas which lead to fast convergence rate of AsySVRG: Lemma 4. ∀x, we have q(x) ≤ 2L 2 kx − ut,0k 2 + 2k∇f(x)k 2 . Proof. q(x) = 1 n Xn i=1 k∇fi(x) − ∇fi(ut,0) + ∇f(ut,0)k 2 ≤ 2 n Xn i=1 k∇fi(x) − ∇fi(ut,0) + ∇f(ut,0) − ∇f(x)k 2 + 2k∇f(x)k 2 ≤ 2 n Xn i=1 k∇fi(x) − ∇fi(ut,0)k 2 + 2k∇f(x)k 2 ≤2L 2 kx − ut,0k 2 + 2k∇f(x)k 2 . According to Lemma 4, we can find that AsySVRG is a variance reduction method for non-convex problems, because when uˆt,m, ut,0 get close to some stationary point, q(uˆt,m) gets close to 0. And hence we do not need the bounded gradient assumption for the convergence proof. Since ut,m 6= uˆt,m, the difficulty of convergence analysis lies in the gap between ut,m and uˆt,m, and the relation between q(uˆt,m) and q(ut,m). Lemma 5. In AsySVRG, we have Eq(uˆt,m) < ρEq(uˆt,m+1) if we choose ρ and η to satisfy that 1 1 − η − 9η(τ+1)L2(ρτ+1−1) ρ−1 ≤ ρ. Lemma 6. With the condition about ρ, η in Lemma 5, we have Ekut,m − uˆt,mk 2 ≤ 4η 2 τ ρ(ρ τ − 1) ρ − 1 Eq(uˆt,m). (6) Lemma 7. With the condition about ρ, η in Lemma 5, we have Eq(uˆt,m) < ρEq(ut,m). Combining Lemma 6 and Lemma 7, we can directly obtain: E kuˆt,m − ut,mk 2 ≤ 4η 2 τ ρ2 (ρ τ − 1) ρ − 1 Eq(ut,m). (7)
Theorem 2.We define cm =cm+1(1+Bn)+2L2n2hm+1. we have hm =((Cmp)with co.80. p-1. ERt.m+1 Furthermore,we choose co,n,B such that y=min- 2eH里-2n2hm+1>0 and cx=0,where M=M×n =Ef(ui.m+1)+cm+illut.m+1-ut.oll2 Then we have fu.a)-(受-2 i2)(w. ∑∑IVm)sEJ(wo)-Efw到 1 T-1M-1 (m TMy t0n=0 +cm+1(1+Bn)Ellut.m-ut.oll2 Proof.In the tth outer-loop,similar to (Reddi et al.2016), +(cm+1+2)E到m2 we define Rt.m as follows Rt.m f(ut.m)+Cmllut.m-ut.ol2. go)-(受-2 /(wP ++g9,- p-1 Then VB>0, +cm+1(1+Bn)Ellut.m -ut.oll2 Ellut.m+1-ut.olu.m,it.m] +(cm+1+2),m, SEllut.m+1-ut.ml2+lut.m -ut.oll2 where the last inequality uses equation(7). -2n(EBtmvt.m)T(ut.m -ut.o) For convenience,we use hm =( ≤72El.m2+((1+3nl训lu,m-.,ol2 2学)ro2e-1+p(cm+号).Since El,ml3= 0-1 +.m) Eq(ut.m)<pEq(ut,m),we have ≤72E,m2+(1+Bn训lut,m-t,ol2 ERt.m+1 +7vwml+Ivaa)-Vu fuam)-(受-2)lVf(w.)P ≤n2El,ml2+(1+Bn)l训ut,m-u些o2 +Cm+1(1+Bn)Ellut.m-u.oll2 +n2hm+iEq(ut.m) +uar ≤Ef(ut,m) +[cm+1(1+Bm)+2L272hm+1]Elu4,m-u此.0l2 +L2llut.m ut.m2), (8) -(受-2-2r+:)EjV/(w)F. where the second inequality uses the fact 2ab Ba2+b2. where the second inequality uses Lemma 4.Then we can Since the objective function is L-smooth,we have obtain: E[f(ut.m+1)ut.m;tt.m] _2m+1-2n2hm+1)EI7f(unm)2 2 -nE[Vf(ut.m)TBt.mVf(ft.m)lut.m,tt.m] ≤ERm-ERm+1, +fue)+genunal where cm cm+1(1+Bn)+2L2nhm+1. We set co >0.It is easy to see that cm >Cm+1.We can =f(ut.m)-nVf(ut.m)TBVf(it.m) choose co,n,B to make c =0.Then we have: +字w. M ∑Vf(u,m)川2 ≤fum)-受I7fu.n)I2 m=0 tmem ≤Ao-E立-Efw)-Efw4 Y 2 which is equivalent to nm小 (9) Ta∑∑Vf0 EI(wo)-Ew】 1 T-1M-1 where the first equality uses the independence of Bt.m,it.m. t=0m=0 TMy the second inequality uses Lemma 1.Combining(8)and (9)
Theorem 2. We define cm = cm+1(1 + βη) + 2L 2η 2hm+1, hm = ( ηL2 2 + 2cmη β ) 4τρ2 (ρ τ −1) ρ−1 +(cmρ+ Lρ 2 ) with c0, β > 0. Furthermore, we choose c0, η, β such that γ = min αη 2 − 2cm+1η β − 2η 2hm+1 > 0 and cM˜ = 0, where M˜ = M × p. Then we have 1 TM˜ T X−1 t=0 M X˜ −1 m=0 Ek∇f(ut,m)k 2 ≤ Ef(w0) − Ef(wT ) TM γ ˜ . Proof. In the t th outer-loop, similar to (Reddi et al. 2016), we define Rt,m as follows Rt,m = f(ut,m) + cmkut,m − ut,0k 2 . Then ∀β > 0, E[kut,m+1 − ut,0k 2 |ut,m, uˆt,m] ≤Ekut,m+1 − ut,mk 2 + kut,m − ut,0k 2 − 2η(EBt,mvˆt,m) T (ut,m − ut,0) ≤η 2Ekvˆt,mk 2 + (1 + βη)kut,m − ut,0k 2 + η β k∇f(uˆt,m)k 2 ≤η 2Ekvˆt,mk 2 + (1 + βη)kut,m − ut,0k 2 + 2η β (k∇f(ut,m)k + k∇f(uˆt,m) − ∇f(ut,m)k 2 ) ≤η 2Ekvˆt,mk 2 + (1 + βη)kut,m − ut,0k 2 + 2η β (k∇f(ut,m)k 2 + L 2 kuˆt,m − ut,mk 2 ), (8) where the second inequality uses the fact 2ab ≤ βa2 + 1 β b 2 . Since the objective function is L-smooth, we have E[f(ut,m+1)|ut,m, uˆt,m] ≤ − ηE[∇f(ut,m) T Bt,m∇fit,m(uˆt,m)|ut,m, uˆt,m] + f(ut,m) + Lη2 2 E[kvˆt,mk 2 |ut,m, uˆt,m] =f(ut,m) − η∇f(ut,m) T B∇f(uˆt,m) + Lη2 2 E[kvˆt,mk 2 |ut,m, uˆt,m] ≤f(ut,m) − αη 2 k∇f(ut,m)k 2 + ηL2 2 kut,m − uˆt,mk 2 + Lη2 2 E[kvˆt,mk 2 |ut,m, uˆt,m], (9) where the first equality uses the independence of Bt,m, it,m, the second inequality uses Lemma 1. Combining (8) and (9), we have ERt,m+1 =Ef(ut,m+1) + cm+1kut,m+1 − ut,0k 2 ≤Ef(ut,m) − ( αη 2 − 2cm+1η β )Ek∇f(ut,m)k 2 + (ηL2 2 + 2cm+1η β )Ekut,m − uˆt,mk 2 + cm+1(1 + βη)Ekut,m − ut,0k 2 + η 2 (cm+1 + L 2 )Ekvˆt,mk 2 ≤Ef(ut,m) − ( αη 2 − 2cm+1η β )Ek∇f(ut,m)k 2 + (ηL2 2 + 2cm+1η β ) 4τη2ρ 2 (ρ τ − 1) ρ − 1 Eq(ut,m) + cm+1(1 + βη)Ekut,m − ut,0k 2 + η 2 (cm+1 + L 2 )Ekvˆt,mk 2 , where the last inequality uses equation (7). For convenience, we use hm = ( ηL2 2 + 2cmη β ) 4τρ2 (ρ τ −1) ρ−1 + ρ(cm + L 2 ). Since E[kvˆt,mk 2 ] = Eq(uˆt,m) ≤ ρEq(ut,m), we have ERt,m+1 ≤Ef(ut,m) − ( αη 2 − 2cm+1η β )Ek∇f(ut,m)k 2 + cm+1(1 + βη)Ekut,m − ut,0k 2 + η 2hm+1Eq(ut,m) ≤Ef(ut,m) + [cm+1(1 + βη) + 2L 2 η 2hm+1]Ekut,m − ut,0k 2 − ( αη 2 − 2cm+1η β − 2η 2hm+1)Ek∇f(ut,m)k 2 , where the second inequality uses Lemma 4. Then we can obtain: ( αη 2 − 2cm+1η β − 2η 2hm+1)Ek∇f(um)k 2 ≤ ERm − ERm+1, where cm = cm+1(1 + βη) + 2L 2η 2hm+1. We set c0 > 0. It is easy to see that cm > cm+1. We can choose c0, η, β to make cM˜ = 0. Then we have: M X˜ −1 m=0 Ek∇f(ut,m)k 2 ≤ ER0 − ERM˜ γ = Ef(wt) − Ef(wt+1) γ , which is equivalent to 1 TM˜ T X−1 t=0 M X˜ −1 m=0 Ek∇f(ut,m)k 2 ≤ Ef(w0) − Ef(wT ) TM γ ˜