Learning to Act optimally Reinforcement Learning Brian C. Williams 16410 December 8th. 2003 Slides adapted from Manuela veloso Reid simmons Tom Mitchell. CMU TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States Board configurations(1020) Actions Rewards +100 if win 100 if lose 0 for all other states Trained by playing 1.5 million games against self Currently, roughly equal to best human player
Learning to Act Optimally: Reinforcement Learning Brian C. Williams 16.410 December 8th, 2003 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 1 TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. Î Currently, roughly equal to best human player. 2
Markov Decision Processes and Reinforcement Learning · Motivation earning policies through reinforcement Q values Q learning · Summary Appendix: Nondeterministic MDPs Reinforcement Learning problem Given: Repeatedly Executed action Observed state Observed reward Leam action policy t: s-A State//Reward Action Maximizes life reward Environment om any start state Discount: 0<y< a Note Unsupervised learning Goal: Learn to choose actions Delayed reward that maximize life reward Model not known ro+YG,+y2r2
3 Markov Decision Processes and Reinforcement Learning • • Learning policies through reinforcement • • • 4 Reinforcement Learning Problem • • • Learn action policy S: S o A • r0 + J r1 + J 2 r2 . . . from any start state. • 1 Note: • • • Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Goal: Learn to choose actions r0 + J r1 + J 2 r2 . . . Motivation Q values Q learning • Appendix: Nondeterministic MDPs Given: Repeatedly… Executed action Observed state Observed reward Maximizes life reward Discount: 0 Unsupervised learning Delayed reward Model not known Environment that maximize life reward Summary J
How About Learning value Functions? Have agent learn value function Vi, denoted V. 2. Agent selects optimal action by one step lookahead on V T*()=argmaxa [r(s, a)+yV(8(s, a) Problem: Works well if agent knows the environment model δ:SXA→S r.SxA→ With no model agent cant choose action from v Eliminating the Model with Q Functions (s)=argmaxa [r(s, a)+yV(s(s, a) Key idea Define function like V that encapsulates S and r Q(s, a)=r(s, a)+yV(8(s, a)) Then, if agent learns Q, it can choose an optimal action without knowing S r. I*(s)=argmaxa Q(s, a)
5 How About Learning Value Functions? VS , denoted V V S*(s) = argmaxa >r(s,a + JV (G(s, a)] Problem: • • G: S x A o S • r: S x A o • . 6 Eliminating the Model with Q Functions S*(s) = argmaxa >r(s,a + JV (G(s, a)] Key idea: • V that encapsulates G and r: Q(s,a = r(s,a + JV (G(s, a)) • Q, it can choose an optimal action without knowing G or r. S*(s) = argmaxa Q(s,a 1. Have agent learn value function 2. Agent selects optimal action by one step lookahead on Works well if agent knows the environment model. With no model, agent can’t choose action from V Define function like Then, if agent learns
How Do We Learn Q? Q(S, a, =r(s, a,+yV(8(s, a ) Need to eliminate v In update rule Note Q and V are closely related V(s=maxa, Q(s, a) Substituting Q for V Q(S,a, =r(s, a, +r maxa Q(s, a) Example-Q Learning Update Y=09 R R Q(s, arghtfr(s,, aright+r maxa, Q(s, a) ←-0+0.9max{63,81,100} 90 Note: if rewards are non-negative For all s, a, n, Q,(s, a)<Qn+1(s, a) For all s,a,n,0≤Qn(s,a≤Q(S,a)
7 How Do We Learn Q? Q(st ,at = r(st ,at + JV (G(st , at )) • • V*(s) = maxa’ Q(s,a’) • Q(st ,at = r(st ,at + Jmaxa’ Q(s’,a’) 8 Example – Q Learning Update Q(s1,aright) m r(s1,aright) + Jmaxa’ Q(s2,a’) m 0 + m 90 90 R 100 81 72 63 R 100 81 63 Note: if rewards are non-negative: ll s, a, n, Qn(s, a) dQn+1(s, a) ll s, a, n, 0 dQn(s, a) dQ(s, a) J Need to eliminate V* In update rule. Note Q and V* are closely related: Substituting Q for V*: • For a • For a 0.9 max {63, 81, 100} = 0.9
Q-Learning Iterations Starts at top left corner-move clockwise around perimeter Initially all values in Q table are zero: y=0.8 Q(S,a)←r+ y maxa, G(sa) S1 Q(S1, E Q(S2, E) Q(s3,S) Q(s4, W) 0 +y maxa Q (s5, loop))= 0 0 r+ymax但Q(84 =0+08xmax{100)=8 10 0 83×088) 8 10 Crib Sheet: Q-Learning for Deterministic Worlds Let q denote the current approximation to Q Initiall For each s, a initialize table entry Q(s, a)<0 Observe current state s Do forever Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for Q(s, a)as follows Q(s,a)∈ y max a2Q(s,a)
9 Q-Learning Iterations • Initially all values in Q table are zero; J Q(s, a) mr+ Jmaxa’ Q(s’,a’) G 10 10 10 s1 s2 s3 s6 s4s5 10 Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: s, a initialize table entry Q(s, a) m0 • s Do forever: • a and execute it • r • s’ • Q (s, a) as follows: Q(s, a) mr+ Jmaxa’ Q(s’,a’) • s ms’ Starts at top left corner – move clockwise around perimeter; = 0.8 • For each Observe current state Select an action Receive immediate reward Observe the new state Update the table entry for Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) 0 0 0 r+ Jmaxa’ {Q(s5,loop)}= 10 + 0.8 x 0 = 10 0 0 r+ Jmaxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x max{10,0) = 8 10 0 r+ Jmaxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8) = 6.4 8 10