《人工智能、机器学习与大数据》课程教学资源（参考文献）Robust Frequent Directions with Application in Online Learning.pdf_P6-P10

LUO.CHEN,ZHANG,LI AND ZHANG where ao >0 is a fixed regularization parameter,ut is the constant in Assumption 2,and nt is typically chosen as (1/t).The second order algorithms enjoy logarithmical regret bound without the strongly convex assumption but require quadratical space and computation cost.Some variants of online Newton algorithms have been applied to optimize neural networks(Martens and Grosse, 2015;Grosse and Martens,2016;Ba et al.,2017),but they do not provide theoretical guarantee on nonconvex cases. 3.2.Efficient Algorithms by Sketching To make the online Newton step scalable,it is natural to use sketching techniques(Woodruff,2014). The matrix H()in online learning has the form H()=(A())TA()+aoId,where A()Rtxd is the corresponding term of (3)or(4)such as A©=g,gT,orA因=V1+hg,,V+g@]T. The sketching algorithm employs an approximation of(A()TA(t)by (B())TB(),where the sketch matrix B()Rmxd is much smaller than A()and md.Then we can use(B())TB()+aola to replace H(t)in update(2)(Luo et al.,2016).By the Woodbury identity formula,we can reduce the computation of the update from (d2)to (m2d)or (md).There are several choices of sketching techniques,such as random projection (Achlioptas,2003;Indyk and Motwani,1998;Kane and Nelson,2014),frequent directions(Ghashami et al.,2016;Liberty,2013)and Oja's algorithm (Oja, 1982;Oja and Karhunen,1985).However,all above methods treat ao as a given hyperparameter which is independent of the sketch matrix B().In practice,the performance of sketched online Newton methods is sensitive to the choice of the hyperparamter ao. 4.Robust Frequent Directions In many machine learning applications such as online learning (Hazan and Arora,2006;Hazan et al.,2007;Hazan,2016;Luo et al.,2016),Gaussian process regression(Rasmussen and Williams, 2006)and kernel ridge regression(Drineas and Mahoney,2005),we usually require an additional regularization term to make the matrix invertible and well-conditioned,while conventional sketching methods only focus on the low-rank approximation.On the other hand,the update of frequent directions is not optimal in the view of minimizing the approximation error in each iteration.Both of them motivate us to propose robust frequent directions(RFD)that incorporates the update of sketch matrix and the regularization term into one framework. 4.1.The Algorithm The RFD approximates ATA by BTB+ala with a>0.We demonstrate the detailed implementa- tion of RFD in Algorithm 2. The main difference between RFD and conventional sketching algorithms is the additional term ala.We can directly use Algorithm 2 to approximate ATA with a(m-1)=ao0 if the target matrix is AA+aold.Compared with the standard FD,RFD only needs to maintain one extra variable a(t)by scalar operations in each iteration,hence the cost of RFD is almost the same as FD. Because the value of a()is typically increasing from the(m+1)-th round in practice,the resulting BB+ald is positive definite even the initial a(0)is zero.Also,we can further accelerate the algorithm by doubling the space. 6

LUO, CHEN, ZHANG, LI AND ZHANG where α0 ≥ 0 is a fixed regularization parameter, µt is the constant in Assumption 2, and ηt is typically chosen as O(1/t). The second order algorithms enjoy logarithmical regret bound without the strongly convex assumption but require quadratical space and computation cost. Some variants of online Newton algorithms have been applied to optimize neural networks (Martens and Grosse, 2015; Grosse and Martens, 2016; Ba et al., 2017), but they do not provide theoretical guarantee on nonconvex cases. 3.2. Efficient Algorithms by Sketching To make the online Newton step scalable, it is natural to use sketching techniques (Woodruff, 2014). The matrix H(t) in online learning has the form H(t) = (A(t) ) >A(t) + α0Id, where A(t) ∈ R t×d is the corresponding term of (3) or (4) such as A(t) = [g (1) , . . . , g (t) ] >, or A(t) = [√ µ1 + η1g (1) , . . . , √ µt + ηtg (t) ] >. The sketching algorithm employs an approximation of (A(t) ) >A(t) by (B(t) ) >B(t) , where the sketch matrix B(t) ∈ R m×d is much smaller than A(t) and m d. Then we can use (B(t) ) >B(t) + α0Id to replace H(t) in update (2) (Luo et al., 2016). By the Woodbury identity formula, we can reduce the computation of the update from O(d 2 ) to O(m2d) or O(md). There are several choices of sketching techniques, such as random projection (Achlioptas, 2003; Indyk and Motwani, 1998; Kane and Nelson, 2014), frequent directions (Ghashami et al., 2016; Liberty, 2013) and Oja’s algorithm (Oja, 1982; Oja and Karhunen, 1985). However, all above methods treat α0 as a given hyperparameter which is independent of the sketch matrix B(t) . In practice, the performance of sketched online Newton methods is sensitive to the choice of the hyperparamter α0. 4. Robust Frequent Directions In many machine learning applications such as online learning (Hazan and Arora, 2006; Hazan et al., 2007; Hazan, 2016; Luo et al., 2016), Gaussian process regression (Rasmussen and Williams, 2006) and kernel ridge regression (Drineas and Mahoney, 2005), we usually require an additional regularization term to make the matrix invertible and well-conditioned, while conventional sketching methods only focus on the low-rank approximation. On the other hand, the update of frequent directions is not optimal in the view of minimizing the approximation error in each iteration. Both of them motivate us to propose robust frequent directions (RFD) that incorporates the update of sketch matrix and the regularization term into one framework. 4.1. The Algorithm The RFD approximates A>A by B>B + αId with α > 0. We demonstrate the detailed implementation of RFD in Algorithm 2. The main difference between RFD and conventional sketching algorithms is the additional term αId. We can directly use Algorithm 2 to approximate A>A with α (m−1) = α0 > 0 if the target matrix is A>A + α0Id. Compared with the standard FD, RFD only needs to maintain one extra variable α (t) by scalar operations in each iteration, hence the cost of RFD is almost the same as FD. Because the value of α (t) is typically increasing from the (m + 1)-th round in practice, the resulting B>B + αId is positive definite even the initial α (0) is zero. Also, we can further accelerate the algorithm by doubling the space. 6

RFD WITH APPLICATION IN ONLINE LEARNING Algorithm 2 Robust Frequent Directions 1: Input: A = [a (1) , . . . , a (T) ] > ∈ R T ×d , B(m−1) = [a (1) , . . . , a (m−1)] >, α (m−1) = 0 2: for t = m, . . . , T do 3: Bb(t−1) = B(t−1) (a (t) ) > 4: Compute SVD: Bb(t−1) = U(t−1)Σ(t−1)(V(t−1)) > 5: B(t) = q Σ (t−1) m−1 2 − σ (t−1) m 2 Im−1 · V (t−1) m−1 > 6: α (t) = α (t−1) + σ (t−1) m 2 /2 7: end for 8: Output: B = B(T) and α = α (T) . 4.2. Theoretical Analysis Before demonstrating the theoretical results of RFD, we review FD from the aspect of low-rank approximation which provides a motivation to the design of our algorithm. At the t-th round iteration of FD (Algorithm 1), we have the matrix B(t−1) which is used to approximate (A(t−1)) >A(t−1) by (B(t−1)) >B(t−1) and we aim to construct a new approximation which includes the new data a (t) , that is, (B (t) ) >B (t) ≈ (B (t−1)) >B (t−1) + a (t) (a (t) ) > = (Bb(t−1)) >Bb(t−1) . (5) The straightforward way to find B(t) is to minimize the approximation error of (5) based on the spectral norm with low-rank constraint: B 0(t) = argmin rank(C)=m−1 (Bb(t−1)) >Bb(t−1) − C>C 2 . (6) By the SVD of Bb(t−1), we have the solution B0(t) = Σ (t−1) m−1 V (t−1) m−1 > . In this view, the update of FD B (t) = q Σ (t−1) m−1 2 − σ (t−1) m 2 Im−1 · V (t−1) m−1 > (7) looks imperfect, because it is not an optimal low-rank approximation. However, the shrinkage operation in (7) is necessary. If we take a greedy strategy (Brand, 2002; Hall et al., 1998; Levey and Lindenbaum, 2000; Ross et al., 2008) which directly replaces B(t) with B0(t) in FD, it will perform worse in some specific cases1 and also has no valid global error bound like (1). Hence, the question is: can we devise a method which enjoys the optimality in each step and maintains global tighter error bound in the same time? Fortunately, RFD is just such an algorithm holding both the properties. We now explain the update rule of RFD formally, and provide the approximation error bound. We first give the following theorem which plays an important role in our analysis. 1. We provide an example in Appendix F. 7

LUO.CHEN,ZHANG,LI AND ZHANG Theorem 1 Given a positive semi-definite matrix M E Rdxd and a positive integer k d,let M=UXU be the SVD of M.Let Uk denote the matrix of the first k columns of U and ak be the k-th singular value of M.Then the pair (C,6),defined as C=Uk(Ek-EIk)12Q and 8=(0k+1+d)/2 where gE od,and Q is an arbitrary k x k orthonormal matrix,is the global minimizer of ceipseM-(CCT+6)2. (8) Additionally,we have IM-(CCT +Sa)Il2<IM-U:xU:ll2, and the equality holds if and only if rank(M)<k. Theorem 1 provides the optimal solution with the closed form for matrix approximation with a regularization term.In the case of rank(M)>k,the approximation CCT+I is full rank and has strictly lower spectral norm error than the rank-k:truncated SVD.Note that Zhang(2014)has established the Frobenius norm based result about the optimal analysis2. Recall that in the streaming case,our goal is to approximate the concentration of historical approximation and current data at the t-th round.The following theorem shows that the update of RFD is optimal with respect to the spectral norm for each step. Theorem 2 Based on the updates in Algorithm 2,we have (B(),a()=argmin (B(-1))TB(-1)+a(-1Ia-(BTB+aId)2 (9) B∈Rd×(m-1).a∈R Theorem 2 explains RFD from an optimization viewpoint.It shows that each step of RFD is optimal for current information.Based on this theorem,the update of the standard FD corresponds (B,a)=(B(),0),which is not the optimal solution of(9).Intuitively,the regularization term of RFD compensates each direction for the over reduction from the shrinkage operation of FD.Theorem 2 also implies RFD is an online extension to the approximation of Theorem 1.We can prove Theorem 2 by using Theorem 1 with M=(B(-1))TB(+-1)+a(-1Ia.We defer the details to Appendix C. RFD also enjoys a tighter approximation error than FD as the following theorem shows. Theorem 3 For any k <m and using the notation of Algorithm 2,we have aA-BB+aL,≤2Z5A-A层. (10) where [Alk is the best rank-k approximation to A in both the Frobenius and spectral norms. The right-hand side of inequality(10)is the half of the one in(1),which means RFD reduces the approximation error significantly with only one extra scalar. The real applications usually consider the matrix with a regularization term.Hence we also consider approximating the matrix M=AA+aola where ao >0 and the rows of A are available 2.We also give a concise proof for the result of Zhang(2014)in Appendix B

LUO, CHEN, ZHANG, LI AND ZHANG Theorem 1 Given a positive semi-definite matrix M ∈ R d×d and a positive integer k < d, let M = UΣU> be the SVD of M. Let Uk denote the matrix of the first k columns of U and σk be the k-th singular value of M. Then the pair (Cb, δb), defined as Cb = Uk(Σk − ξIk) 1/2Q and δb = (σk+1 + σd)/2 where ξ ∈ [σd, σk+1] and Q is an arbitrary k × k orthonormal matrix, is the global minimizer of min C∈Rd×k,δ∈R kM − (CC> + δId)k2. (8) Additionally, we have kM − (CbCb > + δbId)k2 ≤ kM − UkΣkU> k k2, and the equality holds if and only if rank(M) ≤ k. Theorem 1 provides the optimal solution with the closed form for matrix approximation with a regularization term. In the case of rank(M) > k, the approximation CbCb > + δbId is full rank and has strictly lower spectral norm error than the rank-k truncated SVD. Note that Zhang (2014) has established the Frobenius norm based result about the optimal analysis2 . Recall that in the streaming case, our goal is to approximate the concentration of historical approximation and current data at the t-th round. The following theorem shows that the update of RFD is optimal with respect to the spectral norm for each step. Theorem 2 Based on the updates in Algorithm 2, we have (B (t) , α(t) ) = argmin B∈Rd×(m−1),α∈R (Bb(t−1)) >Bb(t−1) + α (t−1)Id − (B >B + αId) 2 . (9) Theorem 2 explains RFD from an optimization viewpoint. It shows that each step of RFD is optimal for current information. Based on this theorem, the update of the standard FD corresponds (B, α) = (B(t) , 0), which is not the optimal solution of (9). Intuitively, the regularization term of RFD compensates each direction for the over reduction from the shrinkage operation of FD. Theorem 2 also implies RFD is an online extension to the approximation of Theorem 1. We can prove Theorem 2 by using Theorem 1 with M = (Bb(t−1)) >Bb(t−1) + α (t−1)Id. We defer the details to Appendix C. RFD also enjoys a tighter approximation error than FD as the following theorem shows. Theorem 3 For any k < m and using the notation of Algorithm 2, we have A>A − (B >B + αId) 2 ≤ 1 2(m − k) kA − [A]kk 2 F , (10) where [A]k is the best rank-k approximation to A in both the Frobenius and spectral norms. The right-hand side of inequality (10) is the half of the one in (1), which means RFD reduces the approximation error significantly with only one extra scalar. The real applications usually consider the matrix with a regularization term. Hence we also consider approximating the matrix M = A>A+α0Id where α0 > 0 and the rows of A are available 2. We also give a concise proof for the result of Zhang (2014) in Appendix B. 8

RFD WITH APPLICATION IN ONLINE LEARNING in sequentially order.Suppose that the standard FD approximates AA by BB.Then it estimates M as MFD BB+aoId.Meanwhile,RFD generates the approximation MRFD =BB+ala by setting a(m-1)=a0.Theorem 4 shows that the condition number of MRFD is better than MFD and M.In general,the equality in Theorem 4 usually can not be achieved fort>m unless (a())T lies in the row space of B(1)exactly or the firsttrows of A have perfect low rank structure.Hence RFD is more likely to generate a well-conditioned approximation than others. Theorem 4 With the notation of Algorithms I and 2,let M=AA+aoId,MED =BB+aoId. MRFD =BTB+ald and a(m-1)=a0,where ao>0is afixed scalar.Then we have k(MRFD)< K(MFD)and K(MRFD)<K(M). 5.The Online Newton Step by RFD We now present the sketched online Newton step by robust frequent directions(RFD-SON).The procedure is shown in Algorithm 3,which is similar to sketched online Newton step(SON)algorithms (Luo et al.,2016)but uses the new sketching method RFD.The matrix H(t)in Line 10 will not be constructed explicitly in practice,which is only used to the ease of analysis.The updates of u(t)and w(t)can be finished in O(md)time and space complexity by the Woodbury identity.We demonstrate the details in Appendix.When d is large,RFD-SON is much efficient than the standard online Newton step with the full Hessian that requires (d2)both in time and space. Note that we do not require the hyperparameter oo to be strictly positive in RFD-SON.In practice, RFD-SON always archives good performance by setting oo =0,which leads to a hyperparameter- free algorithm,while the existing SON algorithm needs to select oo carefully.We consider the general case that oo>0 in this section for the ease of analysis. Theorem5 Let u=min}andK=K Then under Assumptions I and 2 for any w E K,Algorithm 3 has the following regret for oo>0 T r0s罗wP+2C+ m tr((B())TB(T)) a(T) 2(μ+m) +RFD moo (11) where a() T ORFD d-m In m 2(+)co 4(μ+r） a+c2∑a-y2 t=1 t=1 We present the regret bound of RFD-SON for positive oo in Theorem 5.The term SRFp in(11) is the main gap between RFD-SON and the standard online Newton step without sketching.RFp is dominated by the last term which can be bounded as (1).If we exploit the standard FD to sketched online Newton step (Luo et al.,2016)(FD-SON),the regret bound is similar to (11)but the gap will be mSk pD=2m-K)μ+m)a0 9

RFD WITH APPLICATION IN ONLINE LEARNING in sequentially order. Suppose that the standard FD approximates A>A by B>B. Then it estimates M as MFD = B>B + α0Id. Meanwhile, RFD generates the approximation MRFD = B>B + αId by setting α (m−1) = α0. Theorem 4 shows that the condition number of MRFD is better than MFD and M. In general, the equality in Theorem 4 usually can not be achieved for t > m unless (a (t) ) > lies in the row space of B(t−1) exactly or the first t rows of A have perfect low rank structure. Hence RFD is more likely to generate a well-conditioned approximation than others. Theorem 4 With the notation of Algorithms 1 and 2, let M = A>A + α0Id, MFD = B>B + α0Id, MRFD = B>B+αId and α (m−1) = α0, where α0 > 0 is a fixed scalar. Then we have κ(MRFD) ≤ κ(MFD) and κ(MRFD) ≤ κ(M). 5. The Online Newton Step by RFD We now present the sketched online Newton step by robust frequent directions (RFD-SON). The procedure is shown in Algorithm 3, which is similar to sketched online Newton step (SON) algorithms (Luo et al., 2016) but uses the new sketching method RFD. The matrix H(t) in Line 10 will not be constructed explicitly in practice, which is only used to the ease of analysis. The updates of u (t) and w(t) can be finished in O(md) time and space complexity by the Woodbury identity. We demonstrate the details in Appendix . When d is large, RFD-SON is much efficient than the standard online Newton step with the full Hessian that requires O(d 2 ) both in time and space. Note that we do not require the hyperparameter α0 to be strictly positive in RFD-SON. In practice, RFD-SON always archives good performance by setting α0 = 0, which leads to a hyperparameterfree algorithm, while the existing SON algorithm needs to select α0 carefully. We consider the general case that α0 ≥ 0 in this section for the ease of analysis. Theorem 5 Let µ = minT t=1{µt} and K = TT t=1Kt . Then under Assumptions 1 and 2 for any w ∈ K, Algorithm 3 has the following regret for α0 > 0 RT (w) ≤ α0 2 kwk 2 + 2(CL) 2X T t=1 ηt + m 2(µ + ηT ) ln tr (B(T) ) >B(T) mα0 + α (T) α0 + ΩRFD (11) where ΩRFD = d − m 2(µ + ηT ) ln α (T) α0 + m 4(µ + ηT ) X T t=1 (σ (t) m ) 2 α(t) + C 2X T t=1 (σ (t−1) m ) 2 . We present the regret bound of RFD-SON for positive α0 in Theorem 5. The term ΩRFD in (11) is the main gap between RFD-SON and the standard online Newton step without sketching. ΩRFD is dominated by the last term which can be bounded as (1). If we exploit the standard FD to sketched online Newton step (Luo et al., 2016) (FD-SON), the regret bound is similar to (11) but the gap will be ΩFD = mΩk 2(m − k)(µ + ηT )α0 , 9

LUO, CHEN, ZHANG, LI AND ZHANG Algorithm 3 RFD for Online Newton Step 1: Input: α (0) = α0 ≥ 0, m < d, ηt = O(1/t), w(0) = 0 d and B(0) be empty. 2: for t = 0, . . . , T − 1 do 3: Receive example x (t) , and loss function ft(w) 4: Predict the output of x (t) by w(t) and suffer loss ft(w(t) ) 5: g (t) = ∇ft(w(t) ) 6: Bb(t−1) = B(t−1) ( √ µt + ηtg (t) ) > 7: Compute SVD: Bb(t−1) = U(t−1)Σ(t−1)(V(t−1)) > 8: B(t) = q Σ (t−1) m−1 2 − σ (t−1) m 2 Im−1 · (V (t−1) m−1 ) > 9: α (t) = α (t−1) + σ (t−1) m 2 /2 10: H(t) = (B(t) ) >B(t) + α (t) Id 11: u (t+1) = w(t) − (H(t) ) †g (t) 12: w(t+1) = argminw∈Kt+1 kw − u (t+1)kH(t) 13: end for where Ωk plays the similar role to term PT t=1(σ (t−1) m ) 2 in RFD-SON and the detailed definition can be found in Luo et al. (2016). This result is heavily dependent on the hyperparameter α0. If we increase the value of α0, the gap ΩFD can be reduced but the term α0 2 kwk 2 in the bound will increase, and vice versa. In other words, we need to trade off α0 2 kwk 2 and ΩFD by tuning α0 carefully. For RFD-SON, Theorem 5 implies that we can set α0 be sufficiently small to reduce α0 2 kwk 2 and it has limited effect on the term ΩRFD. The reason is that the first term of ΩRFD contains 1 α0 in the logarithmic function and the second term contains α (t) = α0 + 1 2 Pt−1 i=1 (σ (i) m ) 2 in the denominator. For large t, α (t) is mainly dependent on Pt−1 i=1 (σ (i) m ) 2 , rather than α0. Hence the regret bound of RFD-SON is much less sensitive to the hyperparameter α0 than FD-SON. We have PT t=1(σ (t−1) m ) 2 ≤ kA−[A]kk 2 F /(m−k) for k < m by using (17) with A = PT t=1(µt + ηt)g (t) (g (t) ) >. Consider RFD-SON with α0 = 0. Typically, the parameter α (t) is zero at very few first iterations and increase to be strictly positive later. Hence the learning algorithm can be divided into two phases based on whether α (t) is zero. Suppose that T 0 satisfies α (t) ( = 0, t < T0 > 0, t ≥ T 0 . (12) Then the regret can be written as RT (w) = R1:T0(w) + R(T0+1):T (w), 10