当前位置：和泉文库 > 航空航天 > 浏览文档

麻省理工学院：《自制决策制造原则》英文版 Learning to Act optimally Reinforcement Learning

Markov Decision Processes and Reinforcement Learning · Motivation earning policies through reinforcement Q values

文件格式：PDF，文件大小：82.28KB，售价：2.4元

文档详细内容（约8页）

Learning to Act optimally Reinforcement Learning Brian C. Williams 16410 December 8th. 2003 Slides adapted from Manuela veloso Reid simmons Tom Mitchell. CMU TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States Board configurations(1020) Actions Rewards +100 if win 100 if lose 0 for all other states Trained by playing 1.5 million games against self Currently, roughly equal to best human player

Learning to Act Optimally: Reinforcement Learning Brian C. Williams 16.410 December 8th, 2003 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU 1 TD-Gammon [Tesauro, 1995] Learning Through Reinforcement Learns to play Backgammon States: • Board configurations (1020) Actions: • Moves Rewards: • +100 if win • - 100 if lose • 0 for all other states • Trained by playing 1.5 million games against self. Î Currently, roughly equal to best human player. 2

Markov Decision Processes and Reinforcement Learning · Motivation earning policies through reinforcement Q values Q learning · Summary Appendix: Nondeterministic MDPs Reinforcement Learning problem Given: Repeatedly Executed action Observed state Observed reward Leam action policy t: s-A State//Reward Action Maximizes life reward Environment om any start state Discount: 0<y< a Note Unsupervised learning Goal: Learn to choose actions Delayed reward that maximize life reward Model not known ro+YG,+y2r2

3 Markov Decision Processes and Reinforcement Learning • • Learning policies through reinforcement • • • 4 Reinforcement Learning Problem • • • Learn action policy S: S o A • r0 + J r1 + J 2 r2 . . . from any start state. • 1 Note: • • • Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Goal: Learn to choose actions r0 + J r1 + J 2 r2 . . . Motivation Q values Q learning • Appendix: Nondeterministic MDPs Given: Repeatedly… Executed action Observed state Observed reward Maximizes life reward Discount: 0 Unsupervised learning Delayed reward Model not known Environment that maximize life reward Summary J

How About Learning value Functions? Have agent learn value function Vi, denoted V. 2. Agent selects optimal action by one step lookahead on V T*()=argmaxa [r(s, a)+yV(8(s, a) Problem: Works well if agent knows the environment model δ:SXA→S r.SxA→ With no model agent cant choose action from v Eliminating the Model with Q Functions (s)=argmaxa [r(s, a)+yV(s(s, a) Key idea Define function like V that encapsulates S and r Q(s, a)=r(s, a)+yV(8(s, a)) Then, if agent learns Q, it can choose an optimal action without knowing S r. I*(s)=argmaxa Q(s, a)

5 How About Learning Value Functions? VS , denoted V V S*(s) = argmaxa >r(s,a + JV (G(s, a)] Problem: • • G: S x A o S • r: S x A o • . 6 Eliminating the Model with Q Functions S*(s) = argmaxa >r(s,a + JV (G(s, a)] Key idea: • V that encapsulates G and r: Q(s,a = r(s,a + JV (G(s, a)) • Q, it can choose an optimal action without knowing G or r. S*(s) = argmaxa Q(s,a 1. Have agent learn value function 2. Agent selects optimal action by one step lookahead on Works well if agent knows the environment model. With no model, agent can’t choose action from V Define function like Then, if agent learns

How Do We Learn Q? Q(S, a, =r(s, a,+yV(8(s, a ) Need to eliminate v In update rule Note Q and V are closely related V(s=maxa, Q(s, a) Substituting Q for V Q(S,a, =r(s, a, +r maxa Q(s, a) Example-Q Learning Update Y=09 R R Q(s, arghtfr(s,, aright+r maxa, Q(s, a) ←-0+0.9max{63,81,100} 90 Note: if rewards are non-negative For all s, a, n, Q,(s, a)<Qn+1(s, a) For all s,a,n,0≤Qn(s,a≤Q(S,a)

7 How Do We Learn Q? Q(st ,at = r(st ,at + JV (G(st , at )) • • V*(s) = maxa’ Q(s,a’) • Q(st ,at = r(st ,at + Jmaxa’ Q(s’,a’) 8 Example – Q Learning Update Q(s1,aright) m r(s1,aright) + Jmaxa’ Q(s2,a’) m 0 + m 90 90 R 100 81 72 63 R 100 81 63 Note: if rewards are non-negative: ll s, a, n, Qn(s, a) dQn+1(s, a) ll s, a, n, 0 dQn(s, a) dQ(s, a) J Need to eliminate V* In update rule. Note Q and V* are closely related: Substituting Q for V*: • For a • For a 0.9 max {63, 81, 100} = 0.9

Q-Learning Iterations Starts at top left corner-move clockwise around perimeter Initially all values in Q table are zero: y=0.8 Q(S,a)←r+ y maxa, G(sa) S1 Q(S1, E Q(S2, E) Q(s3,S) Q(s4, W) 0 +y maxa Q (s5, loop))= 0 0 r+ymax但Q(84 =0+08xmax{100)=8 10 0 83×088) 8 10 Crib Sheet: Q-Learning for Deterministic Worlds Let q denote the current approximation to Q Initiall For each s, a initialize table entry Q(s, a)<0 Observe current state s Do forever Select an action a and execute it Receive immediate reward r Observe the new state s Update the table entry for Q(s, a)as follows Q(s,a)∈ y max a2Q(s,a)

9 Q-Learning Iterations • Initially all values in Q table are zero; J Q(s, a) mr+ Jmaxa’ Q(s’,a’) G 10 10 10 s1 s2 s3 s6 s4s5 10 Crib Sheet: Q-Learning for Deterministic Worlds Let Q denote the current approximation to Q. Initially: s, a initialize table entry Q(s, a) m0 • s Do forever: • a and execute it • r • s’ • Q (s, a) as follows: Q(s, a) mr+ Jmaxa’ Q(s’,a’) • s ms’ Starts at top left corner – move clockwise around perimeter; = 0.8 • For each Observe current state Select an action Receive immediate reward Observe the new state Update the table entry for Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W) 0 0 0 r+ Jmaxa’ {Q(s5,loop)}= 10 + 0.8 x 0 = 10 0 0 r+ Jmaxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x max{10,0) = 8 10 0 r+ Jmaxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8) = 6.4 8 10

点击进入文档下载页（PDF格式）

共8页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

麻省理工学院：《自制决策制造原则》英文版 Planning to Maximize Reward: Markov Decision processes
麻省理工学院：《自制决策制造原则》英文版 Robot Localization using SIR
麻省理工学院：《自制决策制造原则》英文版 Probabilistic model
麻省理工学院：《自制决策制造原则》英文版 Integer programs solvable as LP
麻省理工学院：《自制决策制造原则》英文版 Courtesy or Eric Feron and Sommer
麻省理工学院：《自制决策制造原则》英文版 Courtesy of Sommer Gentry. Used with permission
麻省理工学院：《自制决策制造原则》英文版 Particle filters for Fun and profit
麻省理工学院：《自制决策制造原则》英文版 Conflict-directed Diagnosis
麻省理工学院：《自制决策制造原则》英文版 Roadmap path planning
麻省理工学院：《自制决策制造原则》英文版 Model-based Diagnosis
麻省理工学院：《自制决策制造原则》英文版 Shortest path and Informed Search
麻省理工学院：《自制决策制造原则》英文版 Programming SATPlan
麻省理工学院：《自制决策制造原则》英文版 Principles of Autonomy and Decision Making
《航线进度计划》（英文版） lec1 Airline Schedule planning
《航线进度计划》（英文版） lec4 Airline Schedule planning
《航线进度计划》（英文版） lec3 Airline Schedule planning
《航线进度计划》（英文版） lec2 multi-commodity Flows
《航线进度计划》（英文版） lec7 crew scheduling
《航线进度计划》（英文版） lec6 fleet assignment
《航线进度计划》（英文版） lec5 passenger mix
《航线进度计划》（英文版） lec11 aop1
《航线进度计划》（英文版） lec12 aop2
《航线进度计划》（英文版） lec10 schedule design
《航线进度计划》（英文版） lec9 crew pairing and aircraft routing

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录