人工智能相关文献资料：Adaptivity and Non-stationarity - Problem-dependent Dynamic Regret for Online Convex Optimization.pdf_P11-P15

Adaptivity and Non-stationarity Lemma 1 immediately implies a static regret bound. Specifically, by choosing comparators as the best decision in hindsight u1 = . . . = uT ∈ arg minx∈X PT t=1 ft(x) such that the path length PT = 0, we obtain the existing result (Chiang et al., 2012, Theorem 11): X T t=1 ft(xt) − min x∈X X T t=1 ft(x) ≤ η(G 2 + 2VT ) + D2 2η = O p 1 + VT , where the last step holds by setting the step size η = η ∗ = min{ q D2 G2+2VT , 1 4L }. Note that the requirement on knowing VT in the optimal tuning can be removed by either doubling trick (Cesa-Bianchi et al., 1997) or self-confident tuning (Auer et al., 2002). However, it is more complicated when competing with a sequence of time-varying comparators. The dynamic regret bound exhibited in Lemma 1 suggests that it is crucial to tune the step size to balance the non-stationarity (path length PT ) and the adaptivity (gradient-variation VT ) in order to achieve a tight dynamic regret bound. Clearly, the optimal tuning is η ∗ = p (D2 + 2DPT )/(2G2 + 2VT ), which unfortunately requires the prior information of PT and VT that are generally unavailable. We emphasize that VT is empirically observable in the sense that at round t ∈ [T] one can observe its internal estimate Vt = Pt s=2 supx∈X ∥∇fs(x) − ∇fs−1(x)∥ 2 2 . By contrast, PT = PT t=2∥ut − ut−1∥2 remains unknown even after all iterations, since the comparators u1, . . . , uT can be chosen arbitrarily as long as they are feasible in the domain and thus are entirely unknown. Consequently, we may use self-confident tuning to remove the dependence on the unknown gradient variation VT , but the method cannot address the unknown path length PT . In fact, this is the fundamental problem of non-stationary online learning — how to deal with uncertainty due to unknown environmental non-stationarity, captured by path length of comparators PT in the language of dynamic regret minimization. To simultaneously handle the uncertainty arising from adaptivity and non-stationarity, in addition to using optimistic online learning to reuse the historical gradients, we further design an adaptive online ensemble method that can hedge the non-stationarity while extracting the adaptivity. Our approach deploys a two-layer meta-base structure, in which multiple base-learners are maintained simultaneously and a meta-algorithm is used to track the best one. More concretely, we first construct a pool of candidate step sizes to discretize value range of the optimal step size, and then initialize multiple base-learners simultaneously, denoted by B1, . . . , BN . Each base-learner Bi returns her own prediction xt,i by running the base-algorithm with a particular step size ηi from the pool. Finally, the intermediate predictions of all the base-learners are combined by a meta-algorithm to produce the final output xt = PN i=1 pt,ixt,i, where pt ∈ ∆N is the weight distribution. According to the above procedure, we can decompose dynamic regret into two parts because of the two-layer meta-base structure. D-RegretT = X T t=1 ft(xt) − X T t=1 ft(ut) = X T t=1 ft(xt) − ft(xt,i) | {z } meta-regret + X T t=1 ft(xt,i) − ft(ut) | {z } base-regret , (10) where {xt}t=1,...,T denotes the final output sequence, and {xt,i}t=1,...,T is the prediction sequence of base-learner Bi . Notably, the decomposition holds for any base-learner’s index 11

ZHAO,ZHANG,ZHANG,ZHOU i[N].The first part is the difference between cumulative loss of the final output sequence and that of the prediction sequence of base-learner Bi,which is introduced by the meta- algorithm and thus named as meta-regret;the second part is the dynamic regret of base learner B,and therefore called base-regret.As a result,we need to make the meta-regret and base-regret scaling with Vr to achieve the desired gradient-variation dynamic regret. In the following,we present two solutions.The first solution,developed in our conference paper (Zhao)s conceptually simpler but requre ries at solutio each round.making it suitable only for the multi-gradient feedback model.The n is an imnro ed algorithm based on a refined analysis of the structure which attains the same dynamic regret guarantee with only one ent iteration n and henc suits for the more g on gradient feedback m the definitions of the nodels ar sented in Section 3.1. 4.2 Multi-Gradient Feedback:Sword Our approach,Sword,implements a meta-base online ensemble structure,in which multiple base-learners are initiated simultaneously (denoted by B: By)and the intermediate For the bas e-algorith we simply employ the OEGD algorithm (Chiang et al,2012). where the base-learner Bi shall update her local decision x.=by Xt=xd-Vf-1(xt-l】,+a=x[i-7f(xt.， (11) size pool H (m,....n}and en a s N ng with the O(logT) e gra a 6 ≤1/ Xt.i reat is tha own grac lient ve need VJt(xt.i)fi=1. adt∈ ly,the gradi pe 01 oped in the me part only ts for the mul dient grad a mc ment and wi e it an lesign an improved algorithm The main difficulty lies in the design and analysis of an appropriate meta-algorithm. In order to be compatible to the gradient-variation base-regret,the meta-algorithm is re- quired to incur a problem-dependent meta-regret of order O(vVrIn N).However,the meta-algorithms used in existing studies (van Erven and Koolen,2016;Zhang et al.,2018a) cannot satisfy the requirements.For example,the vanilla Hedge suffers an O(VTIn N) meta-regret which is problem-independent and thus not suitable for us To this end we design a novel variant of Hedge by leveraging the technique of optimistic online learning with carefully designed optimism,specifically for our problem.Consider the problem of prediction with expert advice.At the beginning of iteration (t+1),in addition to the los vector e ERN returned by the experts,the player can receive an optimism mt+E RN 12

Zhao, Zhang, Zhang, Zhou i ∈ [N]. The first part is the difference between cumulative loss of the final output sequence and that of the prediction sequence of base-learner Bi , which is introduced by the metaalgorithm and thus named as meta-regret; the second part is the dynamic regret of baselearner Bi and therefore called base-regret. As a result, we need to make the meta-regret and base-regret scaling with VT to achieve the desired gradient-variation dynamic regret. In the following, we present two solutions. The first solution, developed in our conference paper (Zhao et al., 2020b), is conceptually simpler but requires N = O(log T) gradient queries at each round, making it suitable only for the multi-gradient feedback model. The second solution is an improved algorithm based on a refined analysis of the problem’s structure, which attains the same dynamic regret guarantee with only one gradient per iteration and hence suits for the more challenging one-gradient feedback model. Recall that the definitions of the multi/one-gradient feedback models are presented in Section 3.1. 4.2 Multi-Gradient Feedback: Sword Our approach, Sword, implements a meta-base online ensemble structure, in which multiple base-learners are initiated simultaneously (denoted by B1, . . . , BN ) and the intermediate predictions of all the base-learners are combined by a meta-algorithm to produce the final output. Below, we describe the specific settings of the base-algorithm and meta-algorithm. For the base-algorithm, we simply employ the OEGD algorithm (Chiang et al., 2012), where the base-learner Bi shall update her local decision {xt,i}t=1,...,T by xt,i = ΠX [xbt,i − ηi∇ft−1(xt−1,i)] , xbt+1,i = ΠX [xbt,i − ηi∇ft(xt,i)] , (11) where ηi > 0 is the associated step size from the step size pool H = {η1, . . . , ηN } and the number of base-learner is chosen as N = O(log T). Lemma 1 ensures an upper bound of base-regret scaling with the gradient variation, i.e., PT t=1 ft(xt,i) − PT t=1 ft(ut) ≤ O(ηi(1 +VT ) +PT /ηi), whenever the step size satisfies ηi ≤ 1/(4L). The caveat is that each base-learner requires her own gradient direction for the update, so we need the gradient information of {∇ft(xt,i)}i=1,...,N at round t ∈ [T]. Notably, the gradient query complexity is N = O(log T) per round instead of one as was desired, consequently, the method developed in this part only suits for the multi-gradient feedback model. We leave the gradient query complexity issue for a moment and will resolve it and design an improved algorithm applicable for the one-gradient feedback model in Section 4.3. The main difficulty lies in the design and analysis of an appropriate meta-algorithm. In order to be compatible to the gradient-variation base-regret, the meta-algorithm is required to incur a problem-dependent meta-regret of order O( √ VT ln N). However, the meta-algorithms used in existing studies (van Erven and Koolen, 2016; Zhang et al., 2018a) cannot satisfy the requirements. For example, the vanilla Hedge suffers an O( √ T ln N) meta-regret, which is problem-independent and thus not suitable for us. To this end, we design a novel variant of Hedge by leveraging the technique of optimistic online learning with carefully designed optimism, specifically for our problem. Consider the problem of prediction with expert advice. At the beginning of iteration (t + 1), in addition to the loss vector ℓt ∈ R N returned by the experts, the player can receive an optimism mt+1 ∈ R N . 12

Adaptivity and Non-stationarity Optimistic Hedge updates the weight vector pt+1 ∈ ∆N by pt+1,i ∝ exp −ε X t s=1 ℓs,i + mt+1,i ! , ∀i ∈ [N]. (12) Here, ε > 0 is the learning rate of the meta-algorithm and we consider a fixed learning rate for simplicity.1 The optimism mt+1 ∈ R N can be interpreted as an optimistic guess of the loss of round t + 1, and we thus incorporate it into the cumulative loss for update. It turns out that Optimistic Hedge can be regarded as an instance of OMD with the negativeentropy regularizer, as mentioned in Remark 1. Therefore, the general result of Theorem 1 implies the following static regret bound of Optimistic Hedge, and the proof can be found in Appendix A. Notably, the negative term shown in (13) will be of great importance in the algorithm design and regret analysis. Lemma 2. The regret of Optimistic Hedge with a fixed learning rate ε > 0 to any expert i ∈ [N] is at most X T t=1 ⟨pt , ℓt⟩ − X T t=1 ℓt,i ≤ ε X T t=1 ∥ℓt − mt∥ 2 ∞ + ln N ε − 1 4ε X T t=2 ∥pt − pt−1∥ 2 1 . (13) Let DT = PT t=1∥ℓt − mt∥ 2 ∞ measure the deviation between optimism and gradient. With a proper learning rate tuning scheme, Optimistic Hedge enjoys an O( √ DT ln N) meta-regret. The framework of optimistic online learning is very powerful for designing adaptive methods, in that the adaptivity quantity DT = PT t=1∥ℓt − mt∥ 2 ∞ is very general and can be specialized flexibly with different configurations of feedback loss ℓt and optimism mt . To achieve the desired gradient-variation dynamic regret, we need to investigate the online ensemble structure carefully. To this end, we specialize Optimistic Hedge in the following way to make the meta-regret compatible with the desired gradient-variation quantity. • The feedback loss ℓt ∈ R N is set as the linearized surrogate loss: for t ∈ [T] and each i ∈ [N], ℓt,i = ⟨∇ft(xt), xt,i⟩. (14) • The optimism mt ∈ R N is set with a careful design: m1 = 0 and for t ≥ 2 and each i ∈ [N], mt,i = ⟨∇ft−1(x¯t), xt,i⟩, where x¯t = X N i=1 pt−1,ixt,i. (15) We will explain the motivation of such designs in Remark 2. Note that this construction of optimism is legitimate as the instrumental variable x¯t only uses the information of pt−1 as well as the local decisions {xt,i}i=1,...,N at time t. Thus, the meta-algorithm of Sword updates the weight pt+1 ∈ R N by pt+1,i ∝ exp −ε X t s=1 ⟨∇fs(xs), xs,i⟩ + ⟨∇ft(x¯t+1), xt+1,i⟩ ! , ∀i ∈ [N]. (16) 1. We adopt the terminology “learning rate” for the meta-algorithm of our approach following the convention in the prediction with expert advice, and use “step size” for the general online convex optimization. 13

ZHAO.ZHANG.ZHANG.ZHOU Algorithm 1 Sword: -algorithm Algo ithm23Sword:tasea4lgoithm ep siz e pool E [N],PO.=1/N any point in 2:for t=1 to T do 2:10】文t+1d= x [-m ft(x t tX+1=∑1Pt+1,iX+1d 5 xt+1 to the meta-algorithm 6:end for 6:end for Algorithm 1 summarizes detailed procedures of the meta-algorithm,which in conjunction with the base-algorithm of Algorithm 2 yields the Sword algorithm. Remark 2 (design of optimism).The design of optimism in (15),particularly the con- struction of the instrumental variable is crucial and is the most challenging part in this method.Our design carefully leverages the structure of two-layer online ensemble methods specifically,the goal of designing optimism is to approximate the current gradient Vf(x) (which is unknown)via the available knowledge till round t.We propose to use Vf-1() as the approximation,and the difference of online functions delivers the gradient-variation term supxellf(x)-f-1(x)2,while the difference between xt and can be upper bounded by the decision variation of the meta-algorithm, s( CP-p-Lallxsall2)s Dlp:-pl (17) which can be eliminated by the negative term in the regret bound of Optimistic Hedge as shown in(13),providing with a suitable setting for the learning rate of the meta-algorithm. Summarizing,the aforementioned configurations of feedback loss and optimism will convert the adaptive quantity Dr to the desired gradient variation Vr plus the decision variation of the meta-algorithm.concretelv. lle-mll)](Vf(x)) ≤D17f(x)-7f-1()3 ≤2D2(7f(x)-Vf-1(x)l3+I7-1(x)-Vf-1(区)3) <2D2supxexllV fi(x)-Vfi-1(x)+2D212x ≤2D2supx∈xIVf(x)-Vf-1(x)3+2DL2lp:-p-1l, where the derivation makes use of the boundedness of the feasible domain,triangle inequal- ity,and the smoothness of online functions.The last term will be canceled by the negative term in the meta-regret,then we obtain the desired gradient-variation regret guarantee. The following theorem shows that the meta-regret is at most O((1+Vr)In N).which is nicely compatible to the attained base-regret.The proof can be found in Section 7.2. 14

Zhao, Zhang, Zhang, Zhou Algorithm 1 Sword: meta-algorithm Input: step size pool H; learning rate ε 1: Initialization: ∀i ∈ [N], p0,i = 1/N 2: for t = 1 to T do 3: Receive xt+1,i from base-learner Bi 4: Update weight pt+1,i by (16) 5: Predict xt+1 = PN i=1 pt+1,ixt+1,i 6: end for Algorithm 2 Sword: base-algorithm Input: step size ηi ∈ H 1: Let xb1,i, x1,i be any point in X 2: for t = 1 to T do 3: xbt+1,i = ΠX xbt,i − ηi∇ft(xt,i) 4: xt+1,i = ΠX xbt+1,i − ηi∇ft(xt,i) 5: Send xt+1,i to the meta-algorithm 6: end for Algorithm 1 summarizes detailed procedures of the meta-algorithm, which in conjunction with the base-algorithm of Algorithm 2 yields the Sword algorithm. Remark 2 (design of optimism). The design of optimism in (15), particularly the construction of the instrumental variable x¯t , is crucial and is the most challenging part in this method. Our design carefully leverages the structure of two-layer online ensemble methods, specifically, the goal of designing optimism is to approximate the current gradient ∇ft(xt) (which is unknown) via the available knowledge till round t. We propose to use ∇ft−1(x¯t) as the approximation, and the difference of online functions delivers the gradient-variation term supx∈X ∥ft(x)−ft−1(x)∥ 2 2 , while the difference between xt and x¯t can be upper bounded by the decision variation of the meta-algorithm, ∥xt − x¯t∥ 2 2 = X N i=1 (pt,i − pt−1,i)xt,i 2 2 ≤ X N i=1 |pt,i − pt−1,i|∥xt,i∥2 2 ≤ D2 ∥pt − pt−1∥ 2 1 , (17) which can be eliminated by the negative term in the regret bound of Optimistic Hedge as shown in (13), providing with a suitable setting for the learning rate of the meta-algorithm. Summarizing, the aforementioned configurations of feedback loss and optimism will convert the adaptive quantity DT to the desired gradient variation VT plus the decision variation of the meta-algorithm, concretely, ∥ℓt − mt∥ 2 ∞ (15) = maxi∈[N] ⟨∇ft(xt) − ∇ft−1(x¯t), xt,i⟩ 2 ≤ D2 ∥∇ft(xt) − ∇ft−1(x¯t)∥ 2 2 ≤ 2D2 (∥∇ft(xt) − ∇ft−1(xt)∥ 2 2 + ∥∇ft−1(xt) − ∇ft−1(x¯t)∥ 2 2 ) ≤ 2D2 supx∈X ∥∇ft(x) − ∇ft−1(x)∥ 2 2 + 2D2L 2 ∥xt − x¯t∥ 2 2 ≤ 2D2 supx∈X ∥∇ft(x) − ∇ft−1(x)∥ 2 2 + 2D4L 2 ∥pt − pt−1∥ 2 1 , where the derivation makes use of the boundedness of the feasible domain, triangle inequality, and the smoothness of online functions. The last term will be canceled by the negative term in the meta-regret, then we obtain the desired gradient-variation regret guarantee. ¶ The following theorem shows that the meta-regret is at most O( p (1 + VT ) ln N), which is nicely compatible to the attained base-regret. The proof can be found in Section 7.2. 14

Adaptivity and Non-stationarity Theorem 2. Under Assumptions 1, 2, and 3, by setting the learning rate of the metaalgorithm (16) optimally as ε = min{1/(4D2L), p (ln N)/(2D2(G2 + VT ))}, the meta-regret of Sword (Algorithm 1) is at most X T t=1 ft(xt) − X T t=1 ft(xt,i) ≤ 2D q 2(G2 + VT ) ln N + 8D2Lln N = O q (1 + VT ) ln N . Note that the optimal learning rate tuning of the meta-algorithm requires the knowledge of gradient variation VT = PT t=2 supx∈X ∥∇ft(x)−∇ft−1(x)∥ 2 2 . The undesired demand can be removed by the self-confident tuning method (Auer et al., 2002), which employs a time-varying learning rate scheme for the meta-algorithm’s update based on internal estimates, roughly, pt+1,i ∝ exp − εt( Pt s=1 ℓs,i + mt+1,i) , ∀i ∈ [N] with εt = O(1/ √ 1 + Vt) with an internal estimate Vt = Pt s=1 supx∈X ∥∇fs(x) − ∇fs−1(x)∥ 2 2 . Besides, notice that this Vt is actually not easy to calculate due to the computation of instantaneous variation supx∈X ∥∇ft(x)−∇ft−1(x)∥ 2 2 , which is a difference of convex functions programming and is not easy to solve even with the explicit form of functions ft and ft−1. Fortunately, we can use an alternative twisted quantity V¯ T = PT t=2∥∇ft(xt)−∇ft−1(xt−1)∥ 2 2 for the learning rate configuration and also achieve the same regret bound via a refined analysis. Then, it suffices to perform the self-confident tuning over V¯ T by monitoring the corresponding internal estimate V¯ t = Pt s=2∥∇fs(xs)−∇fs−1(xs−1)∥ 2 2 , which avoids the burdensome calculations of inner optimization problems and thereby significantly streamlines the computational efforts paid for the adaptive learning rate tuning. So far, the obtained base-regret bound (Lemma 1) and meta-regret bound (Theorem 2) are both adaptive to the gradient variation, and we can simply combine them to achieve the final gradient-variation dynamic regret as stated in Theorem 3, providing with an appropriate candidate step size pool. The proof is provided in Section 7.3. Theorem 3. Under Assumptions 1, 2, and 3, set the pool of candidate step sizes H as H =    ηi = min 1 4L , 2 i−1 s D2 8G2T | i ∈ [N]    , (18) where N = ⌈2 −1 log2 (G2T /(2D2L 2 ))⌉ + 1 is the number of candidate step sizes; and set the learning rate of the meta-algorithm as ε = min{1/(4D2L), p (ln N)/(2D2(G2 + VT ))}. Then, Sword satisfies that X T t=1 ft(xt) − X T t=1 ft(ut) ≤ Oq (1 + PT + VT )(1 + PT ) , which holds for any comparators u1, . . . , uT ∈ X . In above, VT = PT t=2 supx∈X ∥∇ft(x) − ∇ft−1(x)∥ 2 2 is the gradient variation, and PT = PT t=2∥ut−1 − ut∥2 is the path length. Remark 3. Compared with the existing O( p T(1 + PT )) dynamic regret (Zhang et al., 2018a), our result is more adaptive in the sense that it replaces T by the problem-dependent quantity PT +VT . Therefore, the bound will be much tighter in easy problems, for example when both VT and PT are o(T). Meanwhile, it safeguards the same minimax rate, since 15