NJUAT 南京大学 人工智能学院 SCHOOL OF ARTFICIAL INTELUGENCE,NANJING UNFVERSITY Lecture 12.stochastic bandits Advanced Optimization(Fall 2023) Peng Zhao zhaop@lamda.nju.edu.cn Nanjing University
Lecture 12. Stochastic Bandits Peng Zhao zhaop@lamda.nju.edu.cn Nanjing University Advanced Optimization (Fall 2023)
Outline ·Multi-Armed Bandits Explore-Then-Exploit Upper Confidence Bound ·Linear Bandits ·LinUCB Algorithm Generalized Linear Bandits ·Advanced Topics Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 2
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 2 Outline • Multi-Armed Bandits • Explore-Then-Exploit • Upper Confidence Bound • Linear Bandits • LinUCB Algorithm • Generalized Linear Bandits • Advanced Topics
Stochastic Multi-Armed Bandit (MAB) MAB:A player is facing K arms.At each time t,the player pulls one arm a∈[K]and then receives a reward rt(a)∈[O,l: Arm1 1(1) r2(1)】 0.6 r4(1) T5(1) Arm2 1 r2(2) r3(2) 0.2 r5(2) Arm3 r1(3) 0.7 r3(3) r4(3) 0.3 ●Stochastic: Each arm aEK]has an unknown distribution Da with mean u(a), such that rewards ri(a),r2(a),...,rT(a)are i.i.d samples from Da. Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 3
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 3 Stochastic Multi-Armed Bandit (MAB) Arm 1 Arm 2 Arm 3 0.6 0.7 0.3 1 0.2
Stochastic MAB:Formulation At each round t=l,2,·… (1)the player first chooses an arm at E[K]; (2)and then environment reveals a reward rt(at)e[0,1]; (3)the player updates the model by the pair (at,rt(at)). ·The goal is to minimize the(pseudo小-regret: E[Regretr a)-rdan)) where a*=arg maxeK](a)is the best arm in the sense of expectation. Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 4
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 4 Stochastic MAB: Formulation • The goal is to minimize the (pseudo)-regret:
Deploying Exp3 to Stochastic MAB Stochastic MAB is a special case of Adversarial MAB Directly deploying Exp3 for stochastic MAB achieves Theorem1.Suppose that∀t∈[T]andi∈[K],0≤Lt()≤l,then Exp3with learning raten=(In K)/(TK)guarantees EfRegretr=E) where the expectation is taken on the randomness of the algorithm. Not yet exploit benign stochastic assumption....instance-dependent analysis Advanced Optimization(Fall 2023) Lecture 12.Stochastic Bandits 5
Advanced Optimization (Fall 2023) Lecture 12. Stochastic Bandits 5 • Stochastic MAB is a special case of Adversarial MAB Deploying Exp3 to Stochastic MAB Directly deploying Exp3 for stochastic MAB achieves Not yet exploit benign stochastic assumption…. instance-dependent analysis