当前位置：和泉文库 > 航空航天 > 浏览文档

麻省理工学院：《自制决策制造原则》英文版 Planning to Maximize Reward: Markov Decision processes

How Might a mouse search a Maze for Cheese? heese · State Space Search? As a Constraint Satisfaction Problem? Goal-directed Planning As a rule or production System? What is missing? Ideas in this lecture Objective is to accumulate rewards rather than goal states Task is to generate policies for how to act in all situations rather than a plan for a single starting situation

文件格式：PDF，文件大小：187.97KB，售价：7.3元

共25页，可试读9页，点击往前阅读 ↑↑

文档详细内容（约25页）

MDP Environment Assumptions Markov Assumption Next state and reward is a function only of the current state and action S++1=8S, a,) r=rs, ay Uncertain and Unknown Environment D and r may be nondeterministic and unknown MDP Nondeterministic Example R-Research Today we only consider the deterministic case D-Development 0.9 0.1 S1 D D Unemployed Industry 1.0 1.0 0.9 0.9 Grad School Academia 0.1 R 1.0

11 MDP Environment Assumptions • Next state and reward is a function only of the current state and action: • st+1 = δ(st , at ) • rt = r(st , at ) • δ and r may be nondeterministic and unknown 12 MDP Nondeterministic Example S1 Unemployed D R S2 Industry D S3 Grad School D R S4 Academia D R R 0.1 0.9 1.0 0.9 0.1 1.0 0.9 0.1 0.1 0.9 1.0 1.0 Today we only consider Markov Assumption: Uncertain and Unknown Environment: R – Research D – Development the deterministic case

MDP Problem Model State Reward Action Environment o Given an environment model as a MDP create a policy for acting that maximizes lifetime reward MDP Problem: Lifetime Reward State Reward Action Environment Given an environment model as a MDP create a policy for acting that maximizes lifetime reward

13 MDP Problem: Model Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward 14 MDP Problem: Lifetime Reward Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward Environment policy for acting that maximizes Environment policy for acting that maximizes

Lifetime Reward · Finite horizon Rewards accumulate for a fixed period $100K+$100K+$100K=$300K · Infinite horizon Assume reward accumulates for ever $100K $100K +.. infinity · Discounting Future rewards not worth as much (a bird in hand.) Introduce discount factor y 100K+y100K+γ2$100K converges Will make the math work MDP Problem. State Reward Action Environment Given an environment model as a MDP create a policy for acting that maximizes lifetime reward V=ro+yG1+y'r2

15 Lifetime Reward • • • $100K + $100K + $100K = $300K • • • • • (a bird in hand …) • γ $100K + γ $100K + γ 2 $100K. . . • 16 MDP Problem: Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r0 + γ r1 + γ 2 r2 . . . Finite horizon: Rewards accumulate for a fixed period. Infinite horizon: Assume reward accumulates for ever $100K + $100K + . . . = infinity Discounting: Future rewards not worth as much Introduce discount factor converges Will make the math work Environment policy for acting that maximizes

MDP Problem: Policy State Reward Action Environment o Given an environment model as a MDP create a policy for acting that maximizes lifetime reward V=r+yr,+y2r2 assume deterministic world Policy n:S→A Selects an action for each state Optimal policy T": S7A Selects action for each state that maximizes lifetime reward 10 10 G G |10

17 MDP Problem: Policy Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r0 + γ r1 + γ 2 r2 . . . 18 Assume deterministic world Policy π : S ÆA • G 10 10 10 π G 10 10 10 π∗ Optimal policy π∗ : S ÆA • reward. Environment policy for acting that maximizes Selects an action for each state. Selects action for each state that maximizes lifetime

点击进入文档下载页（PDF格式）

共25页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

麻省理工学院：《自制决策制造原则》英文版 Robot Localization using SIR
麻省理工学院：《自制决策制造原则》英文版 Probabilistic model
麻省理工学院：《自制决策制造原则》英文版 Integer programs solvable as LP
麻省理工学院：《自制决策制造原则》英文版 Courtesy or Eric Feron and Sommer
麻省理工学院：《自制决策制造原则》英文版 Courtesy of Sommer Gentry. Used with permission
麻省理工学院：《自制决策制造原则》英文版 Particle filters for Fun and profit
麻省理工学院：《自制决策制造原则》英文版 Conflict-directed Diagnosis
麻省理工学院：《自制决策制造原则》英文版 Roadmap path planning
麻省理工学院：《自制决策制造原则》英文版 Model-based Diagnosis
麻省理工学院：《自制决策制造原则》英文版 Shortest path and Informed Search
麻省理工学院：《自制决策制造原则》英文版 Programming SATPlan
麻省理工学院：《自制决策制造原则》英文版 Solving Constraint Satisfaction Problems Forward Checking
麻省理工学院：《自制决策制造原则》英文版 Learning to Act optimally Reinforcement Learning
麻省理工学院：《自制决策制造原则》英文版 Principles of Autonomy and Decision Making
《航线进度计划》（英文版） lec1 Airline Schedule planning
《航线进度计划》（英文版） lec4 Airline Schedule planning
《航线进度计划》（英文版） lec3 Airline Schedule planning
《航线进度计划》（英文版） lec2 multi-commodity Flows
《航线进度计划》（英文版） lec7 crew scheduling
《航线进度计划》（英文版） lec6 fleet assignment
《航线进度计划》（英文版） lec5 passenger mix
《航线进度计划》（英文版） lec11 aop1
《航线进度计划》（英文版） lec12 aop2
《航线进度计划》（英文版） lec10 schedule design

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录