Adaptivity and Non-stationarity:Problem-dependent Dynamic Regret for Online Convex Optimization Peng Zhao ZHAOP@LAMDA.NJU.EDU.CN Yu-Jie Zhang ZHANGYJOLAMDA.NJU.EDU.CN Lijun Zhang ZHANGLJ@LAMDA.NJU.EDU.CN Zhi-Hua Zhou ZHOUZH@LAMDA.NJU.EDU.CN National Key Laboratory for Novel Software Technology Nanjing University.Nanjing 210023.China Abstract We investigate online convex optimization in non-stationary environments and choose the me as th difference between cumu be the time horizon and th of environments,the state-of-the-art dyn amic regret is O(vT(1+Pr)).Although this bound is proved to b minimax optimal for convex functions,in this paper,we demon t it is p ce t eeasy pro lem rithms that can exploit smoothness and replace the dependence on T in dynamic regret with problem-dependent quantities:the variation in gradients of loss functions,the cu- mulat loss of the co sequen and the minimum of thes two term I hes are tighter than existing results for easy problems and meanwhile guarantee the same rate in the t vorst case. Notably,our proposed algorit hms can achieve favorable dynamic regr one ntper itera qu xity as tive online ensemble.The proposed framework emp loys a two-laver online ensem. ble to handle non-stationarity,and uses optimistic online learning and further introduces aboration within t thereby at con term blem 1.Introduction In many real-world applications,data are inherently accumulated over time,and thus it is of great importance to develop a learning system that updates in an online fashion.Online Convex Optimization (OCO)is a powerful paradigm for learning in such scenarios,which can be regarded as an iterative game between a player and an adversary.At iteration t,the player chooses a decision vector xt from a convex set R.Subsequently,the adversary discloses a convex function f:R,and the player incurs a loss denoted by fi(xt).The 2021 Peng Zhao,Yu-Jie Zhang.Lijun Zhang,and Zhi-Hua Zhou. License:CC-BY40,se https://creativec mons.org/licenses/by/4.0/
Adaptivity and Non-stationarity: Problem-dependent Dynamic Regret for Online Convex Optimization Peng Zhao zhaop@lamda.nju.edu.cn Yu-Jie Zhang zhangyj@lamda.nju.edu.cn Lijun Zhang zhanglj@lamda.nju.edu.cn Zhi-Hua Zhou zhouzh@lamda.nju.edu.cn National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China Abstract We investigate online convex optimization in non-stationary environments and choose the dynamic regret as the performance measure, defined as the difference between cumulative loss incurred by the online algorithm and that of any feasible comparator sequence. Let T be the time horizon and PT be the path length that essentially reflects the non-stationarity of environments, the state-of-the-art dynamic regret is O( p T(1 + PT )). Although this bound is proved to be minimax optimal for convex functions, in this paper, we demonstrate that it is possible to further enhance the guarantee for some easy problem instances, particularly when online functions are smooth. Specifically, we introduce novel online algorithms that can exploit smoothness and replace the dependence on T in dynamic regret with problem-dependent quantities: the variation in gradients of loss functions, the cumulative loss of the comparator sequence, and the minimum of these two terms. These quantities are at most O(T) while could be much smaller in benign environments. Therefore, our results are adaptive to the intrinsic difficulty of the problem, since the bounds are tighter than existing results for easy problems and meanwhile guarantee the same rate in the worst case. Notably, our proposed algorithms can achieve favorable dynamic regret with only one gradient per iteration, sharing the same gradient query complexity as the static regret minimization methods. To accomplish this, we introduce the framework of collaborative online ensemble. The proposed framework employs a two-layer online ensemble to handle non-stationarity, and uses optimistic online learning and further introduces crucial correction terms to enable effective collaboration within the meta-base two layers, thereby attaining adaptivity. We believe the framework can be useful for broader problems. Keywords: Online Learning, Online Convex Optimization, Dynamic Regret, Problemdependent Bounds, Gradient Variation, Optimistic Mirror Descent, Online Ensemble 1. Introduction In many real-world applications, data are inherently accumulated over time, and thus it is of great importance to develop a learning system that updates in an online fashion. Online Convex Optimization (OCO) is a powerful paradigm for learning in such scenarios, which can be regarded as an iterative game between a player and an adversary. At iteration t, the player chooses a decision vector xt from a convex set X ⊆ R d . Subsequently, the adversary discloses a convex function ft : X 7→ R, and the player incurs a loss denoted by ft(xt). The ©2021 Peng Zhao, Yu-Jie Zhang, Lijun Zhang, and Zhi-Hua Zhou. License: CC-BY 4.0, see https://creativecommons.org/licenses/by/4.0/
ZHAO,ZHANG,ZHANG,ZHOU standard performance measure is the (static)regret(Zinkevich,2003), which is the difference between cumulative loss incurred by the online algorithm and that of the best decision in hindsight The ale behin metric is that the bos decision in hindsi all the iterations.H this might be not hold in c nd the ng over tim ess this imitation,dynamic s propose compete with changing comparators D-Regretr(u1,,ur)=∑f(x)-∑f(u) (2) which draws considerable attention recently(Zhang et al.,2018a;Zhao et al.,2020b;Cutkosky 2020:Zhao et al.,2021a;Baby and Wang,2021).The measure is also called the universal dynamic regret (or generl dynamic regret),in the sense that it gives a universal guaran- tee that holds against any comparator sequence.Note that the static regret (1)can be viewed as its special form by choosing comparators as the fixed best decision in hindsight. Moreover.a variant appeared frequently in the literature is called the worst-case dumaric regret (Besbes et al.,2015;Jadbabaie et al.,2015;Mokhtari et al.,2016;Yang et al.,2016; Wei et al.,2016:Zhang et al.,2017:Baby and Wang,2019;Yuan and Lamperski,2020: Zhao et al.,2020a;Zhang et al.,2020a,b;Zhao and Zhang,2021),defined as D-Regretr(xi.,x)=∑f(xt)-∑f(x), @ =1 t=1 which specializes the general form(2)with comparators u=xi E arg minxef(x).There fore,universal dynamic regret is very general and can include the static regret(1)and the worst-case dynamic regret (3)as special cases by different instantiations of comparators. We further remark that the worst-case dynamic regret is often too pessimistic,whereas the universal one is more adaptive to non-stationary environments.Actually,the changes of on- line functions usually come from two aspects:one is the sampling randomness and the other one is the environmental non-stationarity,and clearly the latter one is the main focus of non-stationary online learning.Optimizing the worst-case dynamic regret can be problem- atic in some cases.For example,considering the stochastic optimization task where fr's are independently randomly sampled from the same distribution,then minimizing the worst- case dynamic regret is evidently inappropriate and will eventually lead to overfitting(Zhang et al.,2018a)because the minimizer of online loss function could be dramatically different from the minimizer of the expected loss function due to the sampling randomness There are many studies on the worst-case dynamic regret (Besbes et al 2015 Jadbabai et al..2015:Mokhtari et al.2016:Yang et al.2016:Zhang et al.2017.2018b:Baby and Wang,2019;Zhang et al.,2020b:Zhao and Zhang,2021),but only few results are known for the universal dynamic regret.Zinkevich(2003)shows that online gradient descent (OGD) with a step sizen>0achieves an ((1+Pr)n)universal dynamic regret,where Pr= 2
Zhao, Zhang, Zhang, Zhou standard performance measure is the (static) regret (Zinkevich, 2003), S-RegretT = X T t=1 ft(xt) − min x∈X X T t=1 ft(x), (1) which is the difference between cumulative loss incurred by the online algorithm and that of the best decision in hindsight. The rationale behind such a metric is that the best fixed decision in hindsight is reasonably good over all the iterations. However, this might be too optimistic and may not hold in changing environments, where data are evolving and the optimal decision is drifting over time. To address this limitation, dynamic regret is proposed to compete with changing comparators u1, . . . , uT ∈ X , D-RegretT (u1, . . . , uT ) = X T t=1 ft(xt) − X T t=1 ft(ut), (2) which draws considerable attention recently (Zhang et al., 2018a; Zhao et al., 2020b; Cutkosky, 2020; Zhao et al., 2021a; Baby and Wang, 2021). The measure is also called the universal dynamic regret (or general dynamic regret), in the sense that it gives a universal guarantee that holds against any comparator sequence. Note that the static regret (1) can be viewed as its special form by choosing comparators as the fixed best decision in hindsight. Moreover, a variant appeared frequently in the literature is called the worst-case dynamic regret (Besbes et al., 2015; Jadbabaie et al., 2015; Mokhtari et al., 2016; Yang et al., 2016; Wei et al., 2016; Zhang et al., 2017; Baby and Wang, 2019; Yuan and Lamperski, 2020; Zhao et al., 2020a; Zhang et al., 2020a,b; Zhao and Zhang, 2021), defined as D-RegretT (x ∗ 1 , . . . , x ∗ T ) = X T t=1 ft(xt) − X T t=1 ft(x ∗ t ), (3) which specializes the general form (2) with comparators ut = x ∗ t ∈ arg minx∈X ft(x). Therefore, universal dynamic regret is very general and can include the static regret (1) and the worst-case dynamic regret (3) as special cases by different instantiations of comparators. We further remark that the worst-case dynamic regret is often too pessimistic, whereas the universal one is more adaptive to non-stationary environments. Actually, the changes of online functions usually come from two aspects: one is the sampling randomness and the other one is the environmental non-stationarity, and clearly the latter one is the main focus of non-stationary online learning. Optimizing the worst-case dynamic regret can be problematic in some cases. For example, considering the stochastic optimization task where ft ’s are independently randomly sampled from the same distribution, then minimizing the worstcase dynamic regret is evidently inappropriate and will eventually lead to overfitting (Zhang et al., 2018a) because the minimizer of online loss function could be dramatically different from the minimizer of the expected loss function due to the sampling randomness. There are many studies on the worst-case dynamic regret (Besbes et al., 2015; Jadbabaie et al., 2015; Mokhtari et al., 2016; Yang et al., 2016; Zhang et al., 2017, 2018b; Baby and Wang, 2019; Zhang et al., 2020b; Zhao and Zhang, 2021), but only few results are known for the universal dynamic regret. Zinkevich (2003) shows that online gradient descent (OGD) with a step size η > 0 achieves an O((1 + PT )/η+ηT) universal dynamic regret, where PT = 2
ADAPTIVITY AND NON-STATIONARITY 2llu-1 ors u1 ,ur and thus reflects the non ationarit D e environ V. nthe path engt K0 could choos opt regr weve this path length quantit is hard t w since the univers c reg o prov amst any le step sh sed in stati ead its a large g ap from the favorable bour h an orac step size tun (2018a)res ng.Zhang et a the proposing a nove online algorithm to search th e optima ng an O(T +Pr))univer dynamic regret,and they also establish ar lower bound to show the mi x optimality Although the rate is minimax optimal for conv ex functions,we would like to design algorithms with more adaptive bounds.Specifically,we aim to enhance the guarantee for some easy problem instances,particularly when the online functions are smooth,by replacing the dependence on T by certain problem-dependent quantities that are O(T)in the worst case while could be much smaller in benign environments.In the e study of stati regret,we can attain such results like small-loss bounds (Srebro et al.,2010)and gradient variation bounds(Chiang et al.,2012).Thus,a natural question arises whether it is possible to achieve similar problem-dependent quarantees for the universal dynamic regret? Our results.In this paper,extending our preliminary work(Zhao et al.,2020b),we pro- vide an affirmative answer by designing online algorithms with problem-dependent dynamic regret bounds.Specifically,we focus on the following two adaptive quantities:the gradient variation of online functions Vr,and the cumulative loss of the comparator sequence Fr, defined as 翠Ii-i6eead与=足f. = (4 = The ould b antities are both at most (T nder standard assumption e much er pro m in We propose two me a hms called Sword rd++( short nlin learning w n dynam regret)that are suita ble for diffe ent Ie dba m00 algorithms are online en methods(Zhou,2012;Zhao. 2021),which adr t a two-laye tructure with a meta-algorithm running over a group of bas We prove tha they enjoy an o(v(1+Pr minfVr,Fr))(1+Pr))dynamic regret variation and small-loss bounds simultaneously.Comparing to the (vT(1+Pr))minimax rate,our result replaces the dependence on T by the problem-dependent quantity Pr minfVr,Fr.Our bounds become much tighter when the problem is easy (for example when Pr and Vr/Fr are sublinear in T),and meanwhile safeguard the same guarantee in the worst case. Hence,our results are adaptive to the intrinsic difficulty of problem instances as well as the non-stationarity of environments Our first algorithm.Sword.achieves the favorable problem-dependent guarantees under the multi-gradient feedback model,namely,the player can query gradient information mul- tiple times at each round.This algorithm is conceptually simple,but the gradient query complexity is O(logT)at each round.Our second algorithm,Sword++,is an improved version that requires only one gradient per iteration,despite using a two-layer online en- semble structure.As a result,Sword++is not only computationally efficient but also more 3
Adaptivity and Non-stationarity PT t=2∥ut−1 − ut∥2 is the path length of comparators u1, . . . , uT and thus reflects the nonstationarity of the environments. When the path length PT was known, one could choose the optimal step size η∗ = Θ(p (1 + PT )/T) and attain an O( p T(1 + PT )) dynamic regret. However, this path length quantity is hard to know since the universal dynamic regret aims to provide guarantees against any feasible comparator sequence. The step size η = Θ(1/ √ T) commonly used in static regret would lead to an inferior O( √ T(1 + PT )) bound, which exhibits a large gap from the favorable bound with an oracle step size tuning. Zhang et al. (2018a) resolve the issue by proposing a novel online algorithm to search the optimal step size η∗, attaining an O( p T(1 + PT )) universal dynamic regret, and they also establish an Ω(p T(1 + PT )) lower bound to show the minimax optimality. Although the rate is minimax optimal for convex functions, we would like to design algorithms with more adaptive bounds. Specifically, we aim to enhance the guarantee for some easy problem instances, particularly when the online functions are smooth, by replacing the dependence on T by certain problem-dependent quantities that are O(T) in the worst case while could be much smaller in benign environments. In the study of static regret, we can attain such results like small-loss bounds (Srebro et al., 2010) and gradientvariation bounds (Chiang et al., 2012). Thus, a natural question arises whether it is possible to achieve similar problem-dependent guarantees for the universal dynamic regret? Our results. In this paper, extending our preliminary work (Zhao et al., 2020b), we provide an affirmative answer by designing online algorithms with problem-dependent dynamic regret bounds. Specifically, we focus on the following two adaptive quantities: the gradient variation of online functions VT , and the cumulative loss of the comparator sequence FT , defined as VT = X T t=2 sup x∈X ∥∇ft(x) − ∇ft−1(x)∥ 2 2 , and FT = X T t=1 ft(ut). (4) The two problem-dependent quantities are both at most O(T) under standard assumptions of online learning, while could be much smaller in easier problem instances. We propose two novel online algorithms called Sword and Sword++ (“Sword” is short for Smoothness-aware online learning with dynamic regret) that are suitable for different feedback models. Our algorithms are online ensemble methods (Zhou, 2012; Zhao, 2021), which admit a two-layer structure with a meta-algorithm running over a group of base-learners. We prove that they enjoy an O( p (1 + PT + min{VT , FT })(1 + PT )) dynamic regret, achieving gradientvariation and small-loss bounds simultaneously. Comparing to the O( p T(1 + PT )) minimax rate, our result replaces the dependence on T by the problem-dependent quantity PT + min{VT , FT }. Our bounds become much tighter when the problem is easy (for example when PT and VT /FT are sublinear in T), and meanwhile safeguard the same guarantee in the worst case. Hence, our results are adaptive to the intrinsic difficulty of problem instances as well as the non-stationarity of environments. Our first algorithm, Sword, achieves the favorable problem-dependent guarantees under the multi-gradient feedback model, namely, the player can query gradient information multiple times at each round. This algorithm is conceptually simple, but the gradient query complexity is O(log T) at each round. Our second algorithm, Sword++, is an improved version that requires only one gradient per iteration, despite using a two-layer online ensemble structure. As a result, Sword++ is not only computationally efficient but also more 3
ZHAO,ZHANG,ZHANG,ZHOU attractive due to its reduced feedback requirements-Sword++can be applied to the more challenging one-gradient feedback model,where the player only receives gradient Vf(x as the feedback after submitting the decision xt.Furthermore,it is worth mentioning that Sword++has the potential to be extended to more constrained feedback models,such as the two-point bandit convex optimization,by further leveraging the technique for gradient- variation static regret minimization presented in (Chiang et al..2013). Technical contributions.Note that there exist studies showing that the worst-case dy namic regret can benefit from smoothness (Yang et al.,2016;Zhang et al.,2017:Zhao and Zhang,2021).However,their analyses do not apply to our universal dynamic regret,as we cannot exploit the optimality condition of comparators u,...,ur,in stark contrast with the worst-case dynamic regret analysis.As a result,we propose an adaptive online ensemble method to hedge non-stationarity while extracting adaptivity.Specifically,we em ploy the meta-base two-layer ensemble to hedge the non-stationarity and utilize optimistic online learning to reuse the historical gradient information adaptively.Two crucial novel ingredients are designed in order to achieve favorable problem-dependent guarantees. .We introduce optimistic mirror descent (OMD)as a unified building block for th algorithm design of dynamic regret minimization in both meta and base levels.We sent generic and completely modular analysis for the dynamic regret of OMD. meoative tern ntial for achieving problem-depe ndent dv amic regret the ollabor ive onlin emble framework In addition to emplo ing optimistic online learning to attain ad laptivity and meta-base structure to -statiornaritclncoOporatcaito decis ion-deviation ilitatese the two layers an red proble lependent bound while is crucial requiring only one gradient per iteration We emphasize that these ingredients are particularly important for achieving gradient- variation dynamic regret,which we will demonstrate to be more fundamental than the small-loss bound.In particular,our overall solution(especially Sword++)effectively utilizes negative terms and introduces correction terms to ensure successful collaboration within the twolaver online ensemble The overall framework of collaborative online ensemble is summarized in Section 5,and we believe that the proposed framework has the potential for broader online learning problems. Organization.The rest is structured as follows.Section 2 briefly reviews the related work.In Section 3,we introduce the problem setup and algorithmic framework,where a generic dynamic regret analysis of optimistic mirror descent is provided.Section 4 presents our main results,in which the gradient-variation dynamic regret bounds are established. Section 5 illustrates a generic framework called collaborative online ensemble that is highly useful for attaining problem-dependent dynamic regret.Section 6 provides some additional results.The major proofs are presented in Section 7.Furthermore,Section 8 reports the ort our theoretical findings.Finally,we conclude the paper in Section 9.Some omitted details and proofs are provided in the appendix. 4
Zhao, Zhang, Zhang, Zhou attractive due to its reduced feedback requirements — Sword++ can be applied to the more challenging one-gradient feedback model, where the player only receives gradient ∇ft(xt) as the feedback after submitting the decision xt . Furthermore, it is worth mentioning that Sword++ has the potential to be extended to more constrained feedback models, such as the two-point bandit convex optimization, by further leveraging the technique for gradientvariation static regret minimization presented in (Chiang et al., 2013). Technical contributions. Note that there exist studies showing that the worst-case dynamic regret can benefit from smoothness (Yang et al., 2016; Zhang et al., 2017; Zhao and Zhang, 2021). However, their analyses do not apply to our universal dynamic regret, as we cannot exploit the optimality condition of comparators u1, . . . , uT , in stark contrast with the worst-case dynamic regret analysis. As a result, we propose an adaptive online ensemble method to hedge non-stationarity while extracting adaptivity. Specifically, we employ the meta-base two-layer ensemble to hedge the non-stationarity and utilize optimistic online learning to reuse the historical gradient information adaptively. Two crucial novel ingredients are designed in order to achieve favorable problem-dependent guarantees. • We introduce optimistic mirror descent (OMD) as a unified building block for the algorithm design of dynamic regret minimization in both meta and base levels. We present generic and completely modular analysis for the dynamic regret of OMD, where the negative term is essential for achieving problem-dependent dynamic regret. • We propose the collaborative online ensemble framework. In addition to employing optimistic online learning to attain adaptivity and meta-base structure to hedge non-stationarity, we incorporate a novel decision-deviation correction term, which facilitates effective collaborations within the two layers and is crucial for achieving the desired problem-dependent bound while requiring only one gradient per iteration. We emphasize that these ingredients are particularly important for achieving gradientvariation dynamic regret, which we will demonstrate to be more fundamental than the small-loss bound. In particular, our overall solution (especially Sword++) effectively utilizes negative terms and introduces correction terms to ensure successful collaboration within the two-layer online ensemble. The overall framework of collaborative online ensemble is summarized in Section 5, and we believe that the proposed framework has the potential for broader online learning problems. Organization. The rest is structured as follows. Section 2 briefly reviews the related work. In Section 3, we introduce the problem setup and algorithmic framework, where a generic dynamic regret analysis of optimistic mirror descent is provided. Section 4 presents our main results, in which the gradient-variation dynamic regret bounds are established. Section 5 illustrates a generic framework called collaborative online ensemble that is highly useful for attaining problem-dependent dynamic regret. Section 6 provides some additional results. The major proofs are presented in Section 7. Furthermore, Section 8 reports the experiments to empirically support our theoretical findings. Finally, we conclude the paper in Section 9. Some omitted details and proofs are provided in the appendix. 4
ADAPTIVITY AND NON-STATIONARITY 2.Related Work In this section,we present a brief review of static regret and dynamic regret minimization for online convex optimization. 2.1 Static Regret Static regret has been extensively studied in online convex optimization.Let T be the time horizon and d be the dir the et bounded by o.O(dlogT and O(logT)for eeristomlinealgorita fu tivoly (Zink wich 2003:Hazan et al.2007). Thes results are od to ptimal (Abernethy et al.2008).More sults can be found in the books (Shale artz 2012.Ha 2016 e there tion to convexity of ere are studie ng stati regret by incorporating smoothe m proposa ence ol y problem-dependent Such lem-dependent bou properties,in particula they can sa guard the worst-ca te yet can be ighte er pro em stances. There are usually two ind bounds (Srebro et al.,2010)and gradient-variation bounds (Chiang et al.,2012) Small-loss bounds are first introduced in the context of prediction with expert ad- vice (Littlestone and Warmuth,1994;Freund and Schapire,1997),which replace the de- pendence on T by cumulative loss of the best expert.Later.Srebro et al.(2010)show that in the online convex optimization setting,OGD with a certain step size scheme can achieve an O(1+F)small-loss regret bound when the online convex functions are smooth and non-negative,where Ff is the cumulative loss of the best decision in hind sight.namely.=f(x")with x*chosen as the offine minimizer.The key technical ingredient in the analysis is to exploit the self-bounding properties of smooth functions Gradient-variation bounds are introduced by Chiang et al.(2012),rooting in the devel- opment of second-order bounds for prediction with expert advice (Cesa-Bianchi et al. 2005)and online convex optimization (Hazan and Kale,2008).For convex and smooth functions.Chi ang et al.(2012)establish an O(v1+VT)gradient-variation regret bound, where v= (x)-Vf-1(x)3 measures the cumulative gradient variation. Gradient-variation bounds are particularly favored in slowly changing environments where online functions evolve gradually. In addition,problem-dependent static r regret bounds are also studied in the bandit on line learnins the gradien riation bo e for +10012 111 nds fo get al,2006;Wei and Luo.2018:Lee et al.,2020b).lin bandits (Lee et al,2020b) emi-h ndits (N 2015),g0 lits (Lv ris et al.,2018;Le and a textual bandits (Allen-Zhu e 2018.F and Krishnamurthy 20211 2.2 Dynamic Regret Dynamic regret enforces the player to compete with time-varying comparators and thus is fa- vored in online learning in open and non-stationary environments(Sugiyama and Kawanabe 2012:Zhao et al.,2021b;Zhou,2022).The notion of dynamic regret is sometimes referred
Adaptivity and Non-stationarity 2. Related Work In this section, we present a brief review of static regret and dynamic regret minimization for online convex optimization. 2.1 Static Regret Static regret has been extensively studied in online convex optimization. Let T be the time horizon and d be the dimension, there exist online algorithms with static regret bounded by O( √ T), O(d log T), and O(log T) for convex, exponentially concave, and strongly convex functions, respectively (Zinkevich, 2003; Hazan et al., 2007). These results are proved to be minimax optimal (Abernethy et al., 2008). More results can be found in the seminal books (Shalev-Shwartz, 2012; Hazan, 2016) and references therein. In addition to exploiting the convexity of functions, there are studies improving static regret by incorporating smoothness, whose main proposal is to replace the dependence on T by problem-dependent quantities. Such problem-dependent bounds enjoy many benign properties, in particular, they can safeguard the worst-case minimax rate yet can be much tighter in easier problem instances. There are usually two kinds of such bounds — small-loss bounds (Srebro et al., 2010) and gradient-variation bounds (Chiang et al., 2012). Small-loss bounds are first introduced in the context of prediction with expert advice (Littlestone and Warmuth, 1994; Freund and Schapire, 1997), which replace the dependence on T by cumulative loss of the best expert. Later, Srebro et al. (2010) show that in the online convex optimization setting, OGD with a certain step size scheme can achieve an O( p 1 + F ∗ T ) small-loss regret bound when the online convex functions are smooth and non-negative, where F ∗ T is the cumulative loss of the best decision in hindsight, namely, F ∗ T = PT t=1 ft(x ∗ ) with x ∗ chosen as the offline minimizer. The key technical ingredient in the analysis is to exploit the self-bounding properties of smooth functions. Gradient-variation bounds are introduced by Chiang et al. (2012), rooting in the development of second-order bounds for prediction with expert advice (Cesa-Bianchi et al., 2005) and online convex optimization (Hazan and Kale, 2008). For convex and smooth functions, Chiang et al. (2012) establish an O( √ 1 + VT ) gradient-variation regret bound, where VT = PT t=2 supx∈X ∥∇ft(x)−∇ft−1(x)∥ 2 2 measures the cumulative gradient variation. Gradient-variation bounds are particularly favored in slowly changing environments where online functions evolve gradually. In addition, problem-dependent static regret bounds are also studied in the bandit online learning setting, including the gradient-variation bounds for two-point bandit convex optimization (Chiang et al., 2013), as well as small-loss bounds for multi-armed bandits (Allenberg et al., 2006; Wei and Luo, 2018; Lee et al., 2020b), linear bandits (Lee et al., 2020b), semi-bandits (Neu, 2015), graph bandits (Lykouris et al., 2018; Lee et al., 2020a), and contextual bandits (Allen-Zhu et al., 2018; Foster and Krishnamurthy, 2021), etc. 2.2 Dynamic Regret Dynamic regret enforces the player to compete with time-varying comparators and thus is favored in online learning in open and non-stationary environments (Sugiyama and Kawanabe, 2012; Zhao et al., 2021b; Zhou, 2022). The notion of dynamic regret is sometimes referred 5