Joumal of Machine Leaming Research 20(2019)1-41 Submitted 12/17:Revised 8/18:Published 2/19 Robust Frequent Directions with Application in Online Learning Luo Luo RICKY @SJTU.EDU.CN Cheng Chen JACK_CHEN1990@SJTU.EDU.CN Department of Computer Science and Engineering Shanghai Jiao Tong University 800 Dongchuan Road,Shanghai,China 200240 Zhihua Zhang* ZHZHANG@MATH.PKU.EDU.CN National Engineering Lab for Big Data Analysis and Applications School of Mathematical Sciences Peking University 5 Yiheyuan Road,Beijing,China 100871 Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology Nanjing University 163 Xianlin Avenue,Nanjing,China 210023 Tong Zhang TONGZHANG @TONGZHANG-ML.ORG Computer Science Mathematics Hong Kong University of Science and Technology Hong Kong Editor:Qiang Liu Abstract The frequent directions(FD)technique is a deterministic approach for online sketching that has many applications in machine learning.The conventional FD is a heuristic procedure that often outputs rank deficient matrices.To overcome the rank deficiency problem,we propose a new sketching strategy called robust frequent directions(RFD)by introducing a regularization term.RFD can be derived from an optimization problem.It updates the sketch matrix and the regularization term adaptively and jointly.RFD reduces the approximation error of FD without increasing the computational cost.We also apply RFD to online learning and propose an effective hyperparameter- free online Newton algorithm.We derive a regret bound for our online Newton algorithm based on RFD,which guarantees the robustness of the algorithm.The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms. Keywords:Matrix approximation,sketching,frequent directions,online convex optimization, online Newton algorithm 1.Introduction The sketching technique is a powerful tool to deal with large scale matrices(Ghashami et al.,2016; Halko et al.,2011;Woodruff,2014),and it has been widely used to speed up machine learning *Corresponding author. C2019 Luo Luo,Cheng Chen,Zhihua Zhang,Wu-Jun Li and Tong Zhang. License:CC-BY 4.0.see https://creativecommons.org/licenses/by/4.0/.Attribution requirements are provided at http://jmlr.org/papers/v20/17-773.html
Journal of Machine Learning Research 20 (2019) 1-41 Submitted 12/17; Revised 8/18; Published 2/19 Robust Frequent Directions with Application in Online Learning Luo Luo RICKY@SJTU.EDU.CN Cheng Chen JACK CHEN1990@SJTU.EDU.CN Department of Computer Science and Engineering Shanghai Jiao Tong University 800 Dongchuan Road, Shanghai, China 200240 Zhihua Zhang∗ ZHZHANG@MATH.PKU.EDU.CN National Engineering Lab for Big Data Analysis and Applications School of Mathematical Sciences Peking University 5 Yiheyuan Road, Beijing, China 100871 Wu-Jun Li LIWUJUN@NJU.EDU.CN National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Department of Computer Science and Technology Nanjing University 163 Xianlin Avenue, Nanjing, China 210023 Tong Zhang TONGZHANG@TONGZHANG-ML.ORG Computer Science & Mathematics Hong Kong University of Science and Technology Hong Kong Editor: Qiang Liu Abstract The frequent directions (FD) technique is a deterministic approach for online sketching that has many applications in machine learning. The conventional FD is a heuristic procedure that often outputs rank deficient matrices. To overcome the rank deficiency problem, we propose a new sketching strategy called robust frequent directions (RFD) by introducing a regularization term. RFD can be derived from an optimization problem. It updates the sketch matrix and the regularization term adaptively and jointly. RFD reduces the approximation error of FD without increasing the computational cost. We also apply RFD to online learning and propose an effective hyperparameterfree online Newton algorithm. We derive a regret bound for our online Newton algorithm based on RFD, which guarantees the robustness of the algorithm. The experimental studies demonstrate that the proposed method outperforms state-of-the-art second order online learning algorithms. Keywords: Matrix approximation, sketching, frequent directions, online convex optimization, online Newton algorithm 1. Introduction The sketching technique is a powerful tool to deal with large scale matrices (Ghashami et al., 2016; Halko et al., 2011; Woodruff, 2014), and it has been widely used to speed up machine learning ∗. Corresponding author. c 2019 Luo Luo, Cheng Chen, Zhihua Zhang, Wu-Jun Li and Tong Zhang. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/. Attribution requirements are provided at http://jmlr.org/papers/v20/17-773.html
LUO.CHEN,ZHANG,LI AND ZHANG algorithms such as second order optimization algorithms(Erdogdu and Montanari,2015;Luo et al., 2016;Pilanci and Wainwright,2017;Roosta-Khorasani and Mahoney,2016a,b;Xu et al.,2016;Ye et al.,2017).There exist several families of matrix sketching strategies,including sparsification, column sampling,random projection (Achlioptas,2003;Indyk and Motwani,1998;Kane and Nelson, 2014;Wang et al.,2016),and frequent directions(FD)(Desai et al.,2016;Ghashami et al.,2016; Huang,2018;Liberty,2013;Mroueh et al.,2017;Ye et al.,2016). Sparsification techniques (Achlioptas and McSherry,2007;Achlioptas et al.,2013;Arora et al., 2006;Drineas and Zouzias,2011)generate a sparse version of the matrix by element-wise sampling, which allows the matrix multiplication to be more efficient with lower space.Column(row)sampling algorithms (Mahoney,2011)include the importance sampling (Drineas et al.,2006a,b;Frieze et al., 2004)and leverage score sampling (Drineas et al.,2012,2008;Papailiopoulos et al.,2014).They define a probability for each row(column)and select a subset by the probability to construct the estimation.Random projection maps the rows(columns)of the matrix into lower dimensional space by a projection matrix.The projection matrix can be constructed in various ways (Woodruff, 2014)such as Gaussian random projections (Johnson and Lindenstrauss,1984;Sarlos,2006),fast Johnson-Lindenstrauss transforms(Ailon and Chazelle,2006;Ailon and Liberty,2009,2013;Kane and Nelson,2014)and sparse random projections (Clarkson and Woodruff,2013;Nelson and Nguyen, 2013).The frequent directions(Ghashami et al.,2016;Liberty,2013)is a deterministic sketching algorithm and achieves optimal tradeoff between approximation error and space. In this paper we are especially concerned with the FD sketching(Desai et al.,2016;Ghashami et al.,2016;Liberty,2013),because it is a stable online sketching approach.FD considers the matrix approximation in the streaming setting.In this case,the data is available in a sequential order and should be processed immediately.Typically,streaming algorithms can only use limited memory at any time.The FD algorithm extends the method of frequent items(Misra and Gries, 1982)to matrix approximation and has tight approximation error bound.However,FD usually leads to a rank deficient approximation,which in turn makes its applications less robust.For example, Newton-type algorithms require a non-singular and well-conditioned approximated Hessian matrix but FD sketching usually generates low-rank matrices.An intuitive and simple way to conquer this gap is to introduce a regularization term to enforce the matrix to be invertible (Luo et al.,2016; Roosta-Khorasani and Mahoney,2016a,b).Typically,the regularization parameter is regarded as a hyperparameter and its choice is separable from the sketching procedure.Since the regularization parameter affects thethe performance heavily in practice,it should be chosen carefully. To overcome the weakness of the FD algorithm,we propose a new sketching approach that we call robust frequent directions(RFD).Unlike conventional sketching methods which only approximate the matrix with a low-rank structure,RFD constructs the low-rank part and updates the regularization term simultaneously.In particular,the update procedure of RFD can be regarded as solving an optimization problem(see Theorem 2).This method is different from the standard FD,giving rise to a tighter error bound. Note that Zhang (2014)proposed matrix ridge approximation (MRA)to approximate a positive semi-definite matrix using an idea similar to RFD.There are two main differences between RFD and MRA.First,RFD is designed for the case that data samples come sequentially and memory is limited, while MRA has to access the whole data set.Second,MRA aims to minimize the approximation error with respect to the Frobenius norm while RFD tries to minimize the spectral-norm approximation error.In general,the spectral norm error bound is more meaningful than the Frobenius norm error bound(Tropp,2015). 2
LUO, CHEN, ZHANG, LI AND ZHANG algorithms such as second order optimization algorithms (Erdogdu and Montanari, 2015; Luo et al., 2016; Pilanci and Wainwright, 2017; Roosta-Khorasani and Mahoney, 2016a,b; Xu et al., 2016; Ye et al., 2017). There exist several families of matrix sketching strategies, including sparsification, column sampling, random projection (Achlioptas, 2003; Indyk and Motwani, 1998; Kane and Nelson, 2014; Wang et al., 2016), and frequent directions (FD) (Desai et al., 2016; Ghashami et al., 2016; Huang, 2018; Liberty, 2013; Mroueh et al., 2017; Ye et al., 2016). Sparsification techniques (Achlioptas and McSherry, 2007; Achlioptas et al., 2013; Arora et al., 2006; Drineas and Zouzias, 2011) generate a sparse version of the matrix by element-wise sampling, which allows the matrix multiplication to be more efficient with lower space. Column (row) sampling algorithms (Mahoney, 2011) include the importance sampling (Drineas et al., 2006a,b; Frieze et al., 2004) and leverage score sampling (Drineas et al., 2012, 2008; Papailiopoulos et al., 2014). They define a probability for each row (column) and select a subset by the probability to construct the estimation. Random projection maps the rows (columns) of the matrix into lower dimensional space by a projection matrix. The projection matrix can be constructed in various ways (Woodruff, 2014) such as Gaussian random projections (Johnson and Lindenstrauss, 1984; Sarlos, 2006), fast Johnson-Lindenstrauss transforms (Ailon and Chazelle, 2006; Ailon and Liberty, 2009, 2013; Kane and Nelson, 2014) and sparse random projections (Clarkson and Woodruff, 2013; Nelson and Nguyen, ˆ 2013). The frequent directions (Ghashami et al., 2016; Liberty, 2013) is a deterministic sketching algorithm and achieves optimal tradeoff between approximation error and space. In this paper we are especially concerned with the FD sketching (Desai et al., 2016; Ghashami et al., 2016; Liberty, 2013), because it is a stable online sketching approach. FD considers the matrix approximation in the streaming setting. In this case, the data is available in a sequential order and should be processed immediately. Typically, streaming algorithms can only use limited memory at any time. The FD algorithm extends the method of frequent items (Misra and Gries, 1982) to matrix approximation and has tight approximation error bound. However, FD usually leads to a rank deficient approximation, which in turn makes its applications less robust. For example, Newton-type algorithms require a non-singular and well-conditioned approximated Hessian matrix but FD sketching usually generates low-rank matrices. An intuitive and simple way to conquer this gap is to introduce a regularization term to enforce the matrix to be invertible (Luo et al., 2016; Roosta-Khorasani and Mahoney, 2016a,b). Typically, the regularization parameter is regarded as a hyperparameter and its choice is separable from the sketching procedure. Since the regularization parameter affects the the performance heavily in practice, it should be chosen carefully. To overcome the weakness of the FD algorithm, we propose a new sketching approach that we call robust frequent directions (RFD). Unlike conventional sketching methods which only approximate the matrix with a low-rank structure, RFD constructs the low-rank part and updates the regularization term simultaneously. In particular, the update procedure of RFD can be regarded as solving an optimization problem (see Theorem 2). This method is different from the standard FD, giving rise to a tighter error bound. Note that Zhang (2014) proposed matrix ridge approximation (MRA) to approximate a positive semi-definite matrix using an idea similar to RFD. There are two main differences between RFD and MRA. First, RFD is designed for the case that data samples come sequentially and memory is limited, while MRA has to access the whole data set. Second, MRA aims to minimize the approximation error with respect to the Frobenius norm while RFD tries to minimize the spectral-norm approximation error. In general, the spectral norm error bound is more meaningful than the Frobenius norm error bound (Tropp, 2015). 2
RFD WITH APPLICATION IN ONLINE LEARNING In a recent study,Luo et al.(2016)proposed a FD-based sketched online Newton(SON)algorithm (FD-SON)to accelerate the standard online Newton algorithms.Owing to the shortcoming of FD,the performance of FD-SON is significantly affected by the choice of the hyperparameter.Naturally,we can leverage RFD to improve online Newton algorithms.Accordingly,we propose a sketched online Newton step based on RFD(RFD-SON).Different from conventional sketched Newton algorithms, RFD-SON is hyperparameter-free.Setting the regularization parameter to be zero initially,RFD-SON will adaptively increase the regularization term.The approximation Hessian will be well-conditioned after a few iterations.Moreover,we prove that RFD-SON has a more robust regret bound than FD-SON,and the experimental results also validate better performance of RFD-SON. The remainder of the paper is organized as follows.In Section 2 we present notation and preliminaries.In Section 3 we review the background of second order online learning and its sketched variants.In Sections 4 and 5 we propose our robust frequent directions(RFD)method and the applications in online learning,with some related theoretical analysis.In Section 6 we demonstrate empirical comparisons with baselines on serval real-world data sets to show the superiority of our algorithms.Finally,we conclude our work in Section 7. 2.Notation and Preliminaries We let Ia denote the dxd identity matrix.For a matrix A=[A]Rnxd of rank r where r min(n,d),we let the condensed singular value decomposition(SVD)of A be A =UZVT where U E Rnxr and V E Rdxr are column orthogonal and =diag(a1,02,...,r)with o1≥o2≥.·,≥or>0 places the nonzero singular values on its diagonal entries.. We use omax(A)to denote the largest singular value and omin(A)to denote the smallest non-zero singular vale.Thsthe ondiomer ofAsA)The matri pseudoinverse of A is defined by At=V>-1UT E Rdxn Additionally,we let IAlF=V∑iA号=V∑=1 be the Frobenius norm and A2= omax(A)be the spectral norm.A matrix norm is said to be unitarily invariant ifPAQ=A for any unitary matrices PRnx and QRdxd.It is easy to verify that both the Frobenius norm and spectral norm are unitarily invariant.We define [A]=UV for<r,where UERnxk and VkE Rdxk are the first k columns of U and V,andk=diag(1,2,...,)Rkxk.Then [Alk is the best rank-k approximation to A in both the Frobenius and spectral norms,that is, [Ak=argmin‖A-Alr=argmin‖A-Al2, rank(A)< rank(A)< Given a positive semidefinite matrix HRdxd,the notation is called H-norm of vector x Rd,that is,x =VxTHx.If matrices A and B have the same size,we let (A,B)denote ∑iAiB 2.1.Frequent Directions We give a brief review of frequent directions(FD)(Ghashami et al.,2016;Liberty,2013),because it is closely related to our proposed method.FD is a deterministic matrix sketching in the row-updates model.For any input matrix ARTxd whose rows come sequentially,it maintains a sketch matrix B E R(m-1)xd with m<T to approximate AA by BB. 3
RFD WITH APPLICATION IN ONLINE LEARNING In a recent study, Luo et al. (2016) proposed a FD-based sketched online Newton (SON) algorithm (FD-SON) to accelerate the standard online Newton algorithms. Owing to the shortcoming of FD, the performance of FD-SON is significantly affected by the choice of the hyperparameter. Naturally, we can leverage RFD to improve online Newton algorithms. Accordingly, we propose a sketched online Newton step based on RFD (RFD-SON). Different from conventional sketched Newton algorithms, RFD-SON is hyperparameter-free. Setting the regularization parameter to be zero initially, RFD-SON will adaptively increase the regularization term. The approximation Hessian will be well-conditioned after a few iterations. Moreover, we prove that RFD-SON has a more robust regret bound than FD-SON, and the experimental results also validate better performance of RFD-SON. The remainder of the paper is organized as follows. In Section 2 we present notation and preliminaries. In Section 3 we review the background of second order online learning and its sketched variants. In Sections 4 and 5 we propose our robust frequent directions (RFD) method and the applications in online learning, with some related theoretical analysis. In Section 6 we demonstrate empirical comparisons with baselines on serval real-world data sets to show the superiority of our algorithms. Finally, we conclude our work in Section 7. 2. Notation and Preliminaries We let Id denote the d×d identity matrix. For a matrix A = [Aij ] ∈ R n×d of rank r where r ≤ min(n, d), we let the condensed singular value decomposition (SVD) of A be A = UΣV> where U ∈ R n×r and V ∈ R d×r are column orthogonal and Σ = diag(σ1, σ2, . . . , σr) with σ1 ≥ σ2 ≥ . . . , ≥ σr > 0 places the nonzero singular values on its diagonal entries. We use σmax(A) to denote the largest singular value and σmin(A) to denote the smallest non-zero singular value. Thus, the condition number of A is κ(A) = σmax(A) σmin(A) . The matrix pseudoinverse of A is defined by A† = VΣ−1U> ∈ R d×n . Additionally, we let kAkF = qP i,j A2 ij = qPr i=1 σ 2 i be the Frobenius norm and kAk2 = σmax(A) be the spectral norm. A matrix norm |||·||| is said to be unitarily invariant if |||PAQ||| = |||A||| for any unitary matrices P ∈ R n×n and Q ∈ R d×d . It is easy to verify that both the Frobenius norm and spectral norm are unitarily invariant. We define [A]k = UkΣkV> k for k ≤ r, where Uk ∈ R n×k and Vk ∈ R d×k are the first k columns of U and V, and Σk = diag(σ1, σ2, . . . , σk) ∈ R k×k . Then [A]k is the best rank-k approximation to A in both the Frobenius and spectral norms, that is, [A]k = argmin rank(Ab )≤k kA − Ab kF = argmin rank(Ab )≤k kA − Ab k2. Given a positive semidefinite matrix H ∈ R d×d , the notation kxkH is called H-norm of vector x ∈ R d , that is, kxkH = √ P x>Hx. If matrices A and B have the same size, we let hA, Bi denote i,j AijBij . 2.1. Frequent Directions We give a brief review of frequent directions (FD) (Ghashami et al., 2016; Liberty, 2013), because it is closely related to our proposed method. FD is a deterministic matrix sketching in the row-updates model. For any input matrix A ∈ R T ×d whose rows come sequentially, it maintains a sketch matrix B ∈ R (m−1)×d with m T to approximate A>A by B>B. 3
LUO.CHEN,ZHANG,LI AND ZHANG We present the detailed implementation of FD in Algorithm 1.The intuition behind FD is similar to that of frequent items.FD periodically shrinks orthogonal vectors by roughly the same amount(Line 5 of Algorithm 1).The shrinking step reduces the square Frobenius norm of the sketch reasonable and guarantees that no direction is reduced too much. Algorithm 1 Frequent Directions 1:Inputs:A=a四,.,aT)]TERTxd,Bm-)=a①..,am-]T 2:for t =m,...,T do 3: Bt-1)= 「B(t-1) (a())T Compute SVD:B(t-1)=U(t-1)>(t-1)(V(t-1))T B9=V(②-2-(og-)21m-·(v-)T 6:end for 7:Output:B=B(T) FD has the following error bound for any k<m, IATA-B'B≤m天A-Ae2 (1) The above result means that the space complexity of FD is optimal regardless of streaming issues because any algorithm satisfying ATA-BTBll2 A-[A](m-k)requires O(md) space to represent matrix B(Ghashami et al.,2016).The dominated computation of the algorithm is computating the SVD of B(-1),which costs (m2d)by the standard SVD implementation. However,the total cost can be reduced from (Tm2d)to (Tmd)by doubling the space (Algorithm 4 in Appendix A)or using the Gu-Eisenstat procedure(Gu and Eisenstat,1993). Desai et al.(2016)proposed some extensions of FD.More specifically,Parameterized FD(PFD) uses an extra hyperparameter to describe the proportion of singular values shrunk in each iteration. PFD improves the performance empirically,but has worse error bound than FD by a constant. Compensative FD(CFD)modifies the output of FD by increasing the singular values and keeps the same error guarantees as FD. 3.Online Newton Methods For ease of demonstrating our work,we would like to introduce sketching techniques in online learning scenarios.First of all,we introduce the background of convex online learning including online Newton step algorithms.Then we discuss the connection between online learning and sketched second order methods,which motivates us to propose a more robust sketching algorithm. 3.1.Convex Online Learning Online learning is performed in a sequence of consecutive rounds(Shalev-Shwartz,2011).We consider the problem of convex online optimization as follows.For a sequence of examples() Rd),and convex smooth loss functions {f:KR}where f(w)(wTx()and K:C Rd 4
LUO, CHEN, ZHANG, LI AND ZHANG We present the detailed implementation of FD in Algorithm 1. The intuition behind FD is similar to that of frequent items. FD periodically shrinks orthogonal vectors by roughly the same amount (Line 5 of Algorithm 1). The shrinking step reduces the square Frobenius norm of the sketch reasonable and guarantees that no direction is reduced too much. Algorithm 1 Frequent Directions 1: Input: A = [a (1) , . . . , a (T) ] > ∈ R T ×d , B(m−1) = [a (1) , . . . , a (m−1)] > 2: for t = m, . . . , T do 3: Bb(t−1) = B(t−1) (a (t) ) > 4: Compute SVD: Bb(t−1) = U(t−1)Σ(t−1)(V(t−1)) > 5: B(t) = q Σ (t−1) m−1 2 − σ (t−1) m 2 Im−1 · V (t−1) m−1 > 6: end for 7: Output: B = B(T) FD has the following error bound for any k < m, kA>A − B >Bk2 ≤ 1 m − k kA − [A]kk 2 F . (1) The above result means that the space complexity of FD is optimal regardless of streaming issues because any algorithm satisfying kA>A − B>Bk2 ≤ kA − [A]kk 2 F /(m − k) requires O(md) space to represent matrix B (Ghashami et al., 2016). The dominated computation of the algorithm is computating the SVD of Bb(t−1), which costs O(m2d) by the standard SVD implementation. However, the total cost can be reduced from O(Tm2d) to O(Tmd) by doubling the space (Algorithm 4 in Appendix A) or using the Gu-Eisenstat procedure (Gu and Eisenstat, 1993). Desai et al. (2016) proposed some extensions of FD. More specifically, Parameterized FD (PFD) uses an extra hyperparameter to describe the proportion of singular values shrunk in each iteration. PFD improves the performance empirically, but has worse error bound than FD by a constant. Compensative FD (CFD) modifies the output of FD by increasing the singular values and keeps the same error guarantees as FD. 3. Online Newton Methods For ease of demonstrating our work, we would like to introduce sketching techniques in online learning scenarios. First of all, we introduce the background of convex online learning including online Newton step algorithms. Then we discuss the connection between online learning and sketched second order methods, which motivates us to propose a more robust sketching algorithm. 3.1. Convex Online Learning Online learning is performed in a sequence of consecutive rounds (Shalev-Shwartz, 2011). We consider the problem of convex online optimization as follows. For a sequence of examples {x (t) ∈ R d}, and convex smooth loss functions {ft : Kt → R} where ft(w) , `t(w>x (t) ) and Kt ⊂ R d 4
RFD WITH APPLICATION IN ONLINE LEARNING are convex compact sets,the learner outputs a predictor w(t)and suffers the loss fi(w())at the t-th round.The cumulative regret at round T is defined as: T R(w)=∑(w)-∑(w t= where w*=argminwex f(w)and=nf. We make the following assumptions on the loss functions. Assumption 1 The loss functions lt satisfy e(z)<L whenever z<C,where L and C are positive constants. Assumption 2 There exists a ut <0 such that for all u,w EK,we have (w)≥+f)r(w-)+货(fi(r(w-)2. Note that for a loss function ft whose domain and gradient have bounded diameter,holding Assump- tion 2 only requires the exp-concave property,which is more general than strong convexity (Hazan, 2016).For example,the square loss function f(w)=(wxt-yt)2 satisfies Assumption 2 with =z if the function is subject to constraints |wTxC andC(Luo et al.,2016),but it is not strongly convex. One typical online learning algorithm is online gradient descent (OGD)(Hazan et al.,2007; Zinkevich,2003).At the (t+1)-th round,OGD exploits the following update rules: u+1)=w©-Bg9, w(+1)argmin lw-u1), W∈Kt+1 where g()-f(w()and B is the learning rate.The algorithm has linear computation cost and achieves logT)regret bound for the H-strongly convex loss. In this paper,we are more interested in online Newton step algorithms (Hazan et al.,2007; Luo et al.,2016).The standard online Newton step keeps the curvature information in the matrix H()E Rdxd sequentially and iterates as follows: ut+)=w@-B(H)-1g, w(+1)=argmin llw-u(). (2) WEKt+1 The matrix H(t)is constructed by the outer product of historical gradients(Duchi et al.,2011;Luo et al.,2016),such as H9=∑g0(g)T+aoL, (3) i=1 rH9=u+mg9(g9yT+aol (4)
RFD WITH APPLICATION IN ONLINE LEARNING are convex compact sets, the learner outputs a predictor w(t) and suffers the loss ft(w(t) ) at the t-th round. The cumulative regret at round T is defined as: RT (w∗ ) = X T t=1 ft(w(t) ) − X T t=1 ft(w∗ ), where w∗ = argminw∈K PT t=1 ft(w) and K = TT t=1Kt . We make the following assumptions on the loss functions. Assumption 1 The loss functions `t satisfy |` 0 t (z)| ≤ L whenever |z| ≤ C, where L and C are positive constants. Assumption 2 There exists a µt ≤ 0 such that for all u, w ∈ K, we have ft(w) ≥ ft(u) + ∇ft(u) >(w − u) + µt 2 ∇ft(u) >(w − u) 2 . Note that for a loss function ft whose domain and gradient have bounded diameter, holding Assumption 2 only requires the exp-concave property, which is more general than strong convexity (Hazan, 2016). For example, the square loss function ft(w) = (w>xt − yt) 2 satisfies Assumption 2 with µt = 1 8C2 if the function is subject to constraints |w>xt | ≤ C and yt ≤ C (Luo et al., 2016), but it is not strongly convex. One typical online learning algorithm is online gradient descent (OGD) (Hazan et al., 2007; Zinkevich, 2003). At the (t+1)-th round, OGD exploits the following update rules: u (t+1) = w(t) − βtg (t) , w(t+1) = argmin w∈Kt+1 kw − u (t+1)k, where g (t) = ∇ft(w(t) ) and βt is the learning rate. The algorithm has linear computation cost and achieves O( L2 H log T) regret bound for the H-strongly convex loss. In this paper, we are more interested in online Newton step algorithms (Hazan et al., 2007; Luo et al., 2016). The standard online Newton step keeps the curvature information in the matrix H(t) ∈ R d×d sequentially and iterates as follows: u (t+1) = w(t) − βt(H(t) ) −1g (t) , w(t+1) = argmin w∈Kt+1 kw − u (t+1)kH(t) . (2) The matrix H(t) is constructed by the outer product of historical gradients (Duchi et al., 2011; Luo et al., 2016), such as H(t) = X t i=1 g (i) (g (i) ) > + α0Id, (3) or H(t) = X t i=1 (µt + ηt)g (i) (g (i) ) > + α0Id, (4) 5