SCIENCE CHINA Information Sciences March2021,Vol.64132103:1-132103:13 ·RESEARCH PAPER· https://doi.org/10.1007/s11432-020-3023-7 On the convergence and improvement of stochastic normalized gradient descent Shen-Yi ZHAO,Yin-Peng XIE Wu-Jun LI" National Key Laboratory for Novel Software Technology,Department of Computer Science and Technology, Nanjing University,Nanjing 210023,China Received 20 March 2020/Revised 6 May 2020/Accepted 3 June 2020/Published online 8 February 2021 Abstract Non-convex models,like deep neural networks,have been widely used in machine learning ap- plications.Training non-convex models is a difficult task owing to the saddle points of models.Recently, stochastic normalized gradient descent(SNGD).which updates the model parameter by a normalized gradient in each iteration,has attracted much attention.Existing results show that SNGD can achieve better per- formance on escaping saddle points than classical training methods like stochastic gradient descent (SGD). However,none of the existing studies has provided theoretical proof about the convergence of SNGD for non-convex problems.In this paper,we firstly prove the convergence of SNGD for non-convex problems. Particularly,we prove that SNGD can achieve the same computation complexity as SGD.In addition,based on our convergence proof of SNGD,we find that SNGD needs to adopt a small constant learning rate for convergence guarantee.This makes SNGD do not perform well on training large non-convex models in prac- tice.Hence,we propose a new method,called stagewise SNGD (S-SNGD),to improve the performance of SNGD.Different from SNGD in which a small constant learning rate is necessary for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduce the learning rate by stage.The convergence of S-SNGD can also be theoretically proved for non-convex problems.Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy. Keywords non-convex problems,stochastic normalized gradient descent,computation complexity Citation Zhao S-Y,Xie Y-P,Li W-J.On the convergence and improvement of stochastic normalized gradient descent.Sci China Inf Sci,2021,64(3):132103,https://doi.org/10.1007/s11432-020-3023-7 1 Introduction Many machine learning models can be formulated as the following optimization problem: min F(w):=E[f(w;a)], (1) u∈Rd where w denotes the model parameter,a denotes a stochastic sample,and f(w;a)denotes the loss function.When a is uniformly sampled from a finite set {ai,a2,...,an},F(w)can also be formulated as1f(w:),where n is the number of samples.The formulation in (1)contains a broad family of machine learning models,such as logistic regression and deep learning models. Stochastic gradient descent (SGD)[1,2]has been one of the most efficient optimization tools for solving(1).In the t-th iteration,SGD randomly selects some samples Z:and calculates a mini-batch gradient to update the model parameter,which is denoted by Wt+1=Wt -agt, (2) whereg=f(wa)is a mini-batch gradient and is the learning rate.Compared to traditional batch training methods like gradient descent (GD).SGD does not need to calculate the full gradient VF(wt)which often costs much computation time.Hence,SGD is more popular and has been Corresponding author(email:liwujun@nju.edu.cn) Science China Press and Springer-Verlag GmbH Germany,part of Springer Nature 2021 info.scichina.com link.springer.com
SCIENCE CHINA Information Sciences March 2021, Vol. 64 132103:1–132103:13 https://doi.org/10.1007/s11432-020-3023-7 c Science China Press and Springer-Verlag GmbH Germany, part of Springer Nature 2021 info.scichina.com link.springer.com . RESEARCH PAPER . On the convergence and improvement of stochastic normalized gradient descent Shen-Yi ZHAO, Yin-Peng XIE & Wu-Jun LI* National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, Nanjing 210023, China Received 20 March 2020/Revised 6 May 2020/Accepted 3 June 2020/Published online 8 February 2021 Abstract Non-convex models, like deep neural networks, have been widely used in machine learning applications. Training non-convex models is a difficult task owing to the saddle points of models. Recently, stochastic normalized gradient descent (SNGD), which updates the model parameter by a normalized gradient in each iteration, has attracted much attention. Existing results show that SNGD can achieve better performance on escaping saddle points than classical training methods like stochastic gradient descent (SGD). However, none of the existing studies has provided theoretical proof about the convergence of SNGD for non-convex problems. In this paper, we firstly prove the convergence of SNGD for non-convex problems. Particularly, we prove that SNGD can achieve the same computation complexity as SGD. In addition, based on our convergence proof of SNGD, we find that SNGD needs to adopt a small constant learning rate for convergence guarantee. This makes SNGD do not perform well on training large non-convex models in practice. Hence, we propose a new method, called stagewise SNGD (S-SNGD), to improve the performance of SNGD. Different from SNGD in which a small constant learning rate is necessary for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduce the learning rate by stage. The convergence of S-SNGD can also be theoretically proved for non-convex problems. Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy. Keywords non-convex problems, stochastic normalized gradient descent, computation complexity Citation Zhao S-Y, Xie Y-P, Li W-J. On the convergence and improvement of stochastic normalized gradient descent. Sci China Inf Sci, 2021, 64(3): 132103, https://doi.org/10.1007/s11432-020-3023-7 1 Introduction Many machine learning models can be formulated as the following optimization problem: min w∈Rd F(w) := E[f(w; a)], (1) where w denotes the model parameter, a denotes a stochastic sample, and f(w; a) denotes the loss function. When a is uniformly sampled from a finite set {a1, a2, . . . , an}, F(w) can also be formulated as 1/nPn i=1 f(w; ai), where n is the number of samples. The formulation in (1) contains a broad family of machine learning models, such as logistic regression and deep learning models. Stochastic gradient descent (SGD) [1, 2] has been one of the most efficient optimization tools for solving (1). In the t-th iteration, SGD randomly selects some samples It and calculates a mini-batch gradient to update the model parameter, which is denoted by wt+1 = wt − αgt, (2) where gt = 1/|It| P a∈It ∇f(wt; a) is a mini-batch gradient and α > 0 is the learning rate. Compared to traditional batch training methods like gradient descent (GD), SGD does not need to calculate the full gradient ∇F(wt) which often costs much computation time. Hence, SGD is more popular and has been * Corresponding author (email: liwujun@nju.edu.cn)
Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:2 Saddle point Sharp minimum Flatten minimum Figure 1 (Color online)Smooth loss curve.The flatten minimum usually leads to a small generalization error [11].In the region of the saddle point,the gradient is small.In the region of the sharp minimum,the gradient is large.Intuitively,because lw+1-wrll=a in SNGD,with a suitable a,normalized gradient can yield a faster escape of saddle points and sharp minimum than unnormalized gradient. widely used in machine learning applications.Many variants of SGD have been proposed 3-5 and there are also many studies that analyze the convergence of SGD [6-8]. With the rapid growth of data,more and more large non-convex models are adopted in machine learning applications for improving the generalization ability.One typical example is the deep neural network with multiple layers,which has achieved success in many areas like image classification 9.However,training non-convex models is a difficult task,because there may be many saddle points and sharp local minimums in non-convex models [10,11],especially in those large models.When wt falls into the region of some saddle points,the gradient will be small.When wt falls into the region of some sharp local minimums. the gradient will be relatively large.Both of these two cases can lead to the inefficiency of(2).When w:falls into the region of some flatten minimums,it usually leads to a small generalization error 11 Figure 1 shows examples about saddle point,sharp local minimum and flatten minimum.Usually,we need to carefully choose the initialization 12 to escape the saddle points and sharp local minimums. Recently,stochastic normalized gradient descent (SNGD)[13-16]has attracted much attention for solving non-convex problems.SNGD is the stochastic version of normalized gradient descent (NGD)[17, 18].In each iteration,SNGD adopts a normalized gradient to update the model parameter.In particular, SNGD can be written as Dt+1=Dt一Q lgell' (3) wheregVf(w:a),0is the learning rate,and is the Euclidean norm.It is easy to get thatlwt+-wtll=a,no matter whether the gradient is large or small.Hence,SNGD has better performance than SGD when wt is around saddle points and sharp local minimums.Theoretical results show that normalized gradient can yield a faster escape of saddle points than SGD 14.Although the convergence of SNGD in(3)for strictly-locally-quasi-convex problems has been theoretically proved in 13,none of the existing studies has provided theoretical proof about the convergence of SNGD in (3) for non-convex problems.Furthermore,we find that SNGD does not perform well on training large non-convex models in practice owing to the requirement of a small constant learning rate. In this paper,we study on the convergence and improvement of SNGD.The main contributions of this paper are outlined as follows. We theoretically prove the convergence of SNGD for non-convex problems.In particular,we prove that SNGD can achieve the same computation complexity (total number of gradient computation)as SGD for non-convex problems. We propose a new method,called S-SNGD,to improve the performance of SNGD.Different from SNGD which necessarily adopts a small constant learning rate for convergence guarantee,S-SNGD can adopt a large initial learning rate and reduces the learning rate by stage.The convergence of S-SNGD can also be theoretically proved for non-convex problems. Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy
Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:2 Flatten minimum Saddle point Sharp minimum Figure 1 (Color online) Smooth loss curve. The flatten minimum usually leads to a small generalization error [11]. In the region of the saddle point, the gradient is small. In the region of the sharp minimum, the gradient is large. Intuitively, because kwt+1 − wtk = α in SNGD, with a suitable α, normalized gradient can yield a faster escape of saddle points and sharp minimum than unnormalized gradient. widely used in machine learning applications. Many variants of SGD have been proposed [3–5] and there are also many studies that analyze the convergence of SGD [6–8]. With the rapid growth of data, more and more large non-convex models are adopted in machine learning applications for improving the generalization ability. One typical example is the deep neural network with multiple layers, which has achieved success in many areas like image classification [9]. However, training non-convex models is a difficult task, because there may be many saddle points and sharp local minimums in non-convex models [10, 11], especially in those large models. When wt falls into the region of some saddle points, the gradient will be small. When wt falls into the region of some sharp local minimums, the gradient will be relatively large. Both of these two cases can lead to the inefficiency of (2). When wt falls into the region of some flatten minimums, it usually leads to a small generalization error [11]. Figure 1 shows examples about saddle point, sharp local minimum and flatten minimum. Usually, we need to carefully choose the initialization [12] to escape the saddle points and sharp local minimums. Recently, stochastic normalized gradient descent (SNGD) [13–16] has attracted much attention for solving non-convex problems. SNGD is the stochastic version of normalized gradient descent (NGD) [17, 18]. In each iteration, SNGD adopts a normalized gradient to update the model parameter. In particular, SNGD can be written as wt+1 = wt − α gt kgtk , (3) where gt = 1/|It| P a∈It ∇f(wt; a), α > 0 is the learning rate, and k · k is the Euclidean norm. It is easy to get that kwt+1 − wtk = α, no matter whether the gradient is large or small. Hence, SNGD has better performance than SGD when wt is around saddle points and sharp local minimums. Theoretical results show that normalized gradient can yield a faster escape of saddle points than SGD [14]. Although the convergence of SNGD in (3) for strictly-locally-quasi-convex problems has been theoretically proved in [13], none of the existing studies has provided theoretical proof about the convergence of SNGD in (3) for non-convex problems. Furthermore, we find that SNGD does not perform well on training large non-convex models in practice owing to the requirement of a small constant learning rate. In this paper, we study on the convergence and improvement of SNGD. The main contributions of this paper are outlined as follows. • We theoretically prove the convergence of SNGD for non-convex problems. In particular, we prove that SNGD can achieve the same computation complexity (total number of gradient computation) as SGD for non-convex problems. • We propose a new method, called S-SNGD, to improve the performance of SNGD. Different from SNGD which necessarily adopts a small constant learning rate for convergence guarantee, S-SNGD can adopt a large initial learning rate and reduces the learning rate by stage. The convergence of S-SNGD can also be theoretically proved for non-convex problems. • Empirical results on deep neural networks show that S-SNGD achieves better performance than SNGD in terms of both training loss and test accuracy
Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:3 2 Preliminary and related work 2.1 Preliminary In this paper,we use lll to denote the Euclidean norm.For a random variable X,E[X]denotes the expectation of X.For any two positive sequences {n}and fyn},Un =O(rn)denotes that there exist two positive constants c and N such that yn cEn,Vn N.For any two positive integers a and b, mod(a,b)=0 denotes that a is divisible by b.For a function (w),Vo(w)denotes the gradient of o(w) at w and V2o(w)denotes the Hessian matrix of o(w)at w We also give the following two definitions. Definition 1.A function o(w)is called an L-smooth function (L >0)if Vu,w, (u)≤(o)+Vo(ojr(u-o)+5lu-wP Definition 2.A function o(w)is called a (Lo.L1)-smooth function (Lo >0,L>0)if o(w)is twice differentiable and Vw, I7(w)川l≤Lo+L1l7(w)l, where V2()denotes the Hessian matrix of o()at w. It is easy to find that if a function (w)is (Lo,0)-smooth,it is also Lo-smooth.If a function (w)is twice differentiable and L-smooth,it is also (L.0)-smooth. In this paper,we adopt the norm of gradient to measure the convergence.In particular,we call w an e-stationary point if VF(w)e.The computation complexity of an algorithm is defined as the total number of gradient computation.The iteration complexity of an algorithm is defined as total number of model parameter updates. 2.2 Related work NGD [17,18 and its variants have attracted much attention for solving non-convex problems.In 14, the authors proposed Saddle-NGD which can be formulated as VF(wt) if mod(t,g)≠0, Wt+1= wI-OTVF(w: VF(w:) (4) w:-aTVF(w:) +t,if mod(t,q)=0, where q>0 is a constant integer,ft is a Gaussian noise.Saddle-NGD is a variant of NGD which combines the noise perturbations.The work in 10 shows that Saddle-NGD achieves better iteration complexity (total number of model updates)than Noisy-GD 19. The first work that analyzes the convergence of SNGD in(3)appears in [13].In particular,it proves that for strictly-locally-quasi-convex problems,SNGD needs O(1/e2)iterations to achieve the result of F(w)-F(w*)<e,where w is the output of SNGD and w*is the optimal solution of(1).Hence,the proved computation complexity to achieve F()-F(w*)<eis O(1/e4)owing to a batch size of (1/e2). Recently,in [15,16],the authors proposed some mixture methods which combine SGD and SNGD.In particular,their formulations can universally be written as Wt+1=W:-nigt: (5) where gt is some stochastic gradients,n=minfa/llgtll,B),a >0,B>0.We can observe that when llgt is larger than a/B,the formulation in (5)is the same as that SNGD.Otherwise,the formulation in (5)is the same as SGD.In [15],the authors proposed a method called SIPDER-SFO.Benefiting from the stochastic path-integrated differential estimator which is defined as (1 >[Vf(w;a)-7f(w-1;a】+gt-1,if mod(t,q)≠0, a∈It 9t= 面Efia.mod.=0 1
Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:3 2 Preliminary and related work 2.1 Preliminary In this paper, we use k · k to denote the Euclidean norm. For a random variable X, E[X] denotes the expectation of X. For any two positive sequences {xn} and {yn}, yn = O(xn) denotes that there exist two positive constants c and N such that yn 6 cxn, ∀n > N. For any two positive integers a and b, mod(a, b) = 0 denotes that a is divisible by b. For a function φ(w), ∇φ(w) denotes the gradient of φ(w) at w and ∇2φ(w) denotes the Hessian matrix of φ(w) at w. We also give the following two definitions. Definition 1. A function φ(w) is called an L-smooth function (L > 0) if ∀u, w, φ(u) 6 φ(w) + ∇φ(w) T(u − w) + L 2 ku − wk 2 . Definition 2. A function φ(w) is called a (L0, L1)-smooth function (L0 > 0, L1 > 0) if φ(w) is twice differentiable and ∀w, k∇2φ(w)k 6 L0 + L1k∇φ(w)k, where ∇2φ(·) denotes the Hessian matrix of φ(·) at w. It is easy to find that if a function φ(w) is (L0, 0)-smooth, it is also L0-smooth. If a function φ(w) is twice differentiable and L-smooth, it is also (L, 0)-smooth. In this paper, we adopt the norm of gradient to measure the convergence. In particular, we call w an ǫ-stationary point if k∇F(w)k 6 ǫ. The computation complexity of an algorithm is defined as the total number of gradient computation. The iteration complexity of an algorithm is defined as total number of model parameter updates. 2.2 Related work NGD [17, 18] and its variants have attracted much attention for solving non-convex problems. In [14], the authors proposed Saddle-NGD which can be formulated as wt+1 = wt − α ∇F(wt) k∇F(wt)k , if mod(t, q) 6= 0, wt − α ∇F(wt) k∇F(wt)k + ξt, if mod(t, q) = 0, (4) where q > 0 is a constant integer, ξt is a Gaussian noise. Saddle-NGD is a variant of NGD which combines the noise perturbations. The work in [10] shows that Saddle-NGD achieves better iteration complexity (total number of model updates) than Noisy-GD [19]. The first work that analyzes the convergence of SNGD in (3) appears in [13]. In particular, it proves that for strictly-locally-quasi-convex problems, SNGD needs O(1/ǫ2 ) iterations to achieve the result of F(w¯) − F(w∗ ) 6 ǫ, where w¯ is the output of SNGD and w∗ is the optimal solution of (1). Hence, the proved computation complexity to achieve F(w¯)−F(w∗ ) 6 ǫ is O(1/ǫ4 ) owing to a batch size of O(1/ǫ2 ). Recently, in [15, 16], the authors proposed some mixture methods which combine SGD and SNGD. In particular, their formulations can universally be written as wt+1 = wt − ηtgt, (5) where gt is some stochastic gradients, ηt = min{α/kgtk, β}, α > 0, β > 0. We can observe that when kgtk is larger than α/β, the formulation in (5) is the same as that SNGD. Otherwise, the formulation in (5) is the same as SGD. In [15], the authors proposed a method called SIPDER-SFO. Benefiting from the stochastic path-integrated differential estimator which is defined as gt = 1 |It| X a∈It [∇f(wt; a) − ∇f(wt−1; a)] + gt−1, if mod(t, q) 6= 0, 1 |It| X a∈It ∇f(wt; a), if mod(t, q) = 0
Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:4 SPIDER-SFO achieves a computation complexity of O(1/e3)under the criterion of Vf(w)ll<e,which is the best convergence result among existing first-order stochastic methods.However,whether ge/lgt can be adopted for small gtll and the corresponding computation complexity is unknown.Furthermore, owing to the requirement of a small constant learning rate a,we find that these mixture methods are not efficient in practice. 3 Convergence of SNGD We briefly present SNGD [13]in Algorithm 1.We can observe that SNGD is as simple as SGD.In each iteration,SNGD only needs to calculate a normalized mini-batch gradient to update the model parameter with a constant learning rate o.Different from SGD in which the stochastic gradient is an unbiased estimation of VF(w),the stochastic gradient in SNGD is a biased one.That means 9t g≠VF(). Hence,the update direction is not necessarily the direction of-VF(wt)in expectation and this makes the convergence analysis difficult. Algorithm 1 SNGD 1:Initialization:wo,a,b,T: 2:fort=0,1,.,T-1do 3: Randomly choose b samples,denoted by It; : Calculate a mini-batch gradient gta f(w:a): 5: w+1=w-aT的: 6:end for 7:Return w,which is randomly chosen from {wo.w1:...,wr} First,we make the following assumption which is common in stochastic optimization. Assumption 1.F(w)>0 and the stochastic gradient of F(w)has a o-bounded variance(o >0),i.e., ElVf(w;a)-VF(w)2<a2,Vw. If F(w)is L-smooth,we have the following lemma which captures a general property in normalized gradient descent. Lemma 1.Assume F(w)is L-smooth (L>0).For any w,g ERd,we define w as follows: w+=w-angl' g (6) where a>0.Then we have IVF(c)≤E四-E型+号+2IFm)-gl Proof. Since F(w)is L-smooth.we obtain F(w.)F(w)-aVF(w)/+ -F(w)-o(VF(w)-a/al-aal+ F(w)+allVF(w)-gll allgll+ Combining the fact that VF(w)l<lgl +VF(w)-gll,we obtain IVF(w F(w)-F(F() According to Lemma 1,we can set w,w+,g in (6)as wt,w:+1,gt respectively.Then we can obtain the following convergence result
Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:4 SPIDER-SFO achieves a computation complexity of O(1/ǫ3 ) under the criterion of k∇f(w¯)k 6 ǫ, which is the best convergence result among existing first-order stochastic methods. However, whether gt/kgtk can be adopted for small kgtk and the corresponding computation complexity is unknown. Furthermore, owing to the requirement of a small constant learning rate α, we find that these mixture methods are not efficient in practice. 3 Convergence of SNGD We briefly present SNGD [13] in Algorithm 1. We can observe that SNGD is as simple as SGD. In each iteration, SNGD only needs to calculate a normalized mini-batch gradient to update the model parameter with a constant learning rate α. Different from SGD in which the stochastic gradient is an unbiased estimation of ∇F(w), the stochastic gradient in SNGD is a biased one. That means E gt kgtk wt 6= ∇F(wt). Hence, the update direction is not necessarily the direction of −∇F(wt) in expectation and this makes the convergence analysis difficult. Algorithm 1 SNGD 1: Initialization: w0, α, b, T; 2: for t = 0, 1, . . . , T − 1 do 3: Randomly choose b samples, denoted by It; 4: Calculate a mini-batch gradient gt = 1 b P a∈It f(wt; a); 5: wt+1 = wt − α gt kgtk ; 6: end for 7: Return w¯, which is randomly chosen from {w0, w1, . . . , wT }. First, we make the following assumption which is common in stochastic optimization. Assumption 1. F(w) > 0 and the stochastic gradient of F(w) has a σ-bounded variance (σ > 0), i.e., Ek∇f(w; a) − ∇F(w)k 2 6 σ 2 , ∀w. If F(w) is L-smooth, we have the following lemma which captures a general property in normalized gradient descent. Lemma 1. Assume F(w) is L-smooth (L > 0). For any w, g ∈ R d , we define w+ as follows: w+ = w − α g kgk , (6) where α > 0. Then we have k∇F(w)k 6 F(w) − F(w+) α + Lα 2 + 2k∇F(w) − gk. Proof. Since F(w) is L-smooth, we obtain F(w+) 6 F(w) − α∇F(w) Tg/kgk + Lα2 2 = F(w) − α(∇F(w) − g) Tg/kgk − αkgk + Lα2 2 6 F(w) + αk∇F(w) − gk − αkgk + Lα2 2 . Combining the fact that k∇F(w)k 6 kgk + k∇F(w) − gk, we obtain k∇F(w)k 6 F(w) − F(w+) α + Lα 2 + 2k∇F(w) − gk. According to Lemma 1, we can set w, w+, g in (6) as wt, wt+1, gt respectively. Then we can obtain the following convergence result
Zhao S-Y,et al.Sci China Inf Sci March 2021 Vol.64 132103:5 Theorem 1.Assume F(w)is L-smooth(L>0).Let {w:}be the sequence produced by Algorithm 1, b=o2/e2.Then we have T+立vF(o训≤Fu+g+ 1 a(T+1)+2+2c. (7) Furthermore,let w denote the output of Algorithm 1.By setting a =2e/L and T=LF(w1)/(2e2), we obtain EVF()<4e.Hence,the iteration complexity and computation complexity to achieve an e-stationary point of SNGD are O(1/e2)and O(1/e4),respectively. Proof. According to Lemma 1,we obtain ElVF(w,l≤E,)-E++9+2 EVF(w,)-gl 2 <EE:u+号+2VvFo-g F)-F++9+ 2+2√6 Ea+号+z (8) a where the second inequality uses the fact that(E)2<El2]and the third inequality uses the fact that ElVF(w:)-gtl2=EllVF(w:)-Vf(wt;a)2/b<o2/6.Summing up (8)from t=0 to T,we obtain 六2训骨+学+a T By setting a=2e/L,T=LF(w1)/(2e2),we obtain EVF()川l≤4e. Hence,the iteration complexity is T=O(1/e2)and the computation complexity is Tb=O(1/e). The result in Theorem 1 implies that SNGD achieves the same computation complexity as SGD for smooth non-convex problems [6].Furthermore,when F(w)is L-smooth and F(w)>0,we can obtain that Vw, F(w)-F(w*)≥2立ITF(oIP where w*E arg min F(w).If F(w)further satisfies the strictly-locally-quasi-convex property [13],then the computation complexity to achieve an e-stationary point in [13]is actually O(1/es),which is much higher than O(1/e)for small e.Hence,our convergence result for SNGD is better than that in [13]for strictly-locally-quasi-convex problems. If F(w)is(Lo,L1)-smooth,we can get the following result in normalized gradient descent. Lemma 2.Assume F(w)is (Lo,L1)-smooth (Lo >0,L1 >0).For any w,gE Rd,we define w+as follows: w+=w-a g' (9) where a >0.Then we have (-eo)F≤Po-H+L++2Ivro-gl 2 Proof.Let r(t)=t(w+-w)+w,t [0,1]:p(t)=IVF(r(t))ll.Since llw+-wll =a,we have (F(F(r()(eac+F(
Zhao S-Y, et al. Sci China Inf Sci March 2021 Vol. 64 132103:5 Theorem 1. Assume F(w) is L-smooth (L > 0). Let {wt} be the sequence produced by Algorithm 1, b = σ 2/ǫ 2 . Then we have 1 T + 1 X T t=0 Ek∇F(wt)k 6 F(w1) α(T + 1) + Lα 2 + 2ǫ. (7) Furthermore, let w¯ denote the output of Algorithm 1. By setting α = 2ǫ/L and T = LF(w1)/(2ǫ 2 ), we obtain Ek∇F(w¯)k 6 4ǫ. Hence, the iteration complexity and computation complexity to achieve an ǫ-stationary point of SNGD are O(1/ǫ2 ) and O(1/ǫ4 ), respectively. Proof. According to Lemma 1, we obtain Ek∇F(wt)k 6 E[F(wt) − F(wt+1)] α + Lα 2 + 2Ek∇F(wt) − gtk 6 E[F(wt) − F(wt+1)] α + Lα 2 + 2p Ek∇F(wt) − gtk 2 6 E[F(wt) − F(wt+1)] α + Lα 2 + 2r σ 2 b = E[F(wt) − F(wt+1)] α + Lα 2 + 2ǫ, (8) where the second inequality uses the fact that (E[x])2 6 E[x 2 ] and the third inequality uses the fact that Ek∇F(wt) − gtk 2 = Ek∇F(wt) − ∇f(wt; a)k 2/b 6 σ 2/b. Summing up (8) from t = 0 to T , we obtain 1 T + 1 X T t=0 Ek∇F(wt)k 6 F(w1) α(T + 1) + Lα 2 + 2ǫ. By setting α = 2ǫ/L, T = LF(w1)/(2ǫ 2 ), we obtain Ek∇F(w¯)k 6 4ǫ. Hence, the iteration complexity is T = O(1/ǫ2 ) and the computation complexity is T b = O(1/ǫ4 ). The result in Theorem 1 implies that SNGD achieves the same computation complexity as SGD for smooth non-convex problems [6]. Furthermore, when F(w) is L-smooth and F(w) > 0, we can obtain that ∀w, F(w) − F(w∗ ) > 1 2L k∇F(w)k 2 , where w∗ ∈ arg minw F(w). If F(w) further satisfies the strictly-locally-quasi-convex property [13], then the computation complexity to achieve an ǫ-stationary point in [13] is actually O(1/ǫ8 ), which is much higher than O(1/ǫ4 ) for small ǫ. Hence, our convergence result for SNGD is better than that in [13] for strictly-locally-quasi-convex problems. If F(w) is (L0, L1)-smooth, we can get the following result in normalized gradient descent. Lemma 2. Assume F(w) is (L0, L1)-smooth (L0 > 0, L1 > 0). For any w, g ∈ R d , we define w+ as follows: w+ = w − α g kgk , (9) where α > 0. Then we have 1 − αL1 2 e αL1 k∇F(w)k 6 F(w) − F(w+) α + (1 + αL1e αL1 )L0α 2 + 2k∇F(w) − gk. Proof. Let r(t) = t(w+ − w) + w, t ∈ [0, 1], p(t) = k∇F(r(t))k. Since kw+ − wk = α, we have p(t) = k∇F(r(t))k = Z t 0 ∇ 2F(r(c))r ′ (c)dc + ∇F(r(0))