Regret Decomposition For stochastic MAB,a natural characterization of the arms: (i)Suboptimality gap:Aa=u(a*)-u(a); (ii)Number of times arm a is pulled int rounds:n(a)=>1=a). Regret can be reformulated as no-立=oi-a T E[Regretr)max E =】 ∑(u(a*)-μ(at)nr(a)=∑△anr(a aE[K] a∈[K] Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 6
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 6 Regret Decomposition • For stochastic MAB, a natural characterization of the arms: • Regret can be reformulated as
A Natural Solution Explore-then-Exploit(ETE): (1)Do explore for the first To round by pulling each arm for To/K times; (2)Do exploit for the rest T-To round by always pulling a=arg maxael]r(a). Theorem1.Suppose that∀t∈[T]anda∈[K],0≤rt(a)≤l,then ETE with exploration period To guarantees eets(段+Tep()a a∈[K1 Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 7
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 7 A Natural Solution • Explore-then-Exploit (ETE):
Proof of ETE Regret Bound Proof. E[Regretr]=∑△anr(a) aE[K] Exploration Exploitation nT(a)=To/K+(T-To)Pra=a} ≤T/K+(T-To)Pr{m(a)≥(a*)} orm.(a)≤u@ta≤m.(a) 2 snK+r-xwr{ao≥@专aua.o)≤@2"a} 2 snK+T-万(r{.o之uoa}+r{≤专a》 Union bound Pr{XUY}<Pr{X}+Pr{Y} Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 8
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 8 Proof of ETE Regret Bound Proof.Exploration Exploitation
Proof of ETE Regret Bound Pof间s五/x+T-(r{ma之ota}+pr{aa)sata}》 Hoeffding's inequality.forX&∈[0,l,i∈m,X-a∑1X,we have Pr{x-E[X]≥e}≤exp(-2me2): Pr{X-E[X]≤-c}≤exp(-2me2). →n{a≥o2a}-nmo≥n0+As 2T△8 K △a=4(a*)-(a) 今g4@s(爱+7m(x)》 a∈[K Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 9
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 9 Proof of ETE Regret Bound Proof
Issue of ETE Theorem1.Suppose that∀t∈[T]anda∈[K],0≤rt(a)≤l,then ETE with explore period To guarantees R≤(假+Tem()a a∈[K ·Need to tune To Tune To with prior of suboptimality gap A:ERegret]=(T) Tune To without prior of suboptimality gap A:E[Regret=O(T2/3) >Solution:do explore and exploit adaptively Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 10
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 10 Issue of ETE • Need to tune Solution: do explore and exploit adaptively