MDP Environment Assumptions Markov Assumption Next state and reward is a function only of the current state and action S++1=8S, a,) r=rs, ay Uncertain and Unknown Environment D and r may be nondeterministic and unknown MDP Nondeterministic Example R-Research Today we only consider the deterministic case D-Development 0.9 0.1 S1 D D Unemployed Industry 1.0 1.0 0.9 0.9 Grad School Academia 0.1 R 1.0
11 MDP Environment Assumptions • Next state and reward is a function only of the current state and action: • st+1 = δ(st , at ) • rt = r(st , at ) • δ and r may be nondeterministic and unknown 12 MDP Nondeterministic Example S1 Unemployed D R S2 Industry D S3 Grad School D R S4 Academia D R R 0.1 0.9 1.0 0.9 0.1 1.0 0.9 0.1 0.1 0.9 1.0 1.0 Today we only consider Markov Assumption: Uncertain and Unknown Environment: R – Research D – Development the deterministic case
MDP Problem Model State Reward Action Environment o Given an environment model as a MDP create a policy for acting that maximizes lifetime reward MDP Problem: Lifetime Reward State Reward Action Environment Given an environment model as a MDP create a policy for acting that maximizes lifetime reward
13 MDP Problem: Model Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward 14 MDP Problem: Lifetime Reward Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward Environment policy for acting that maximizes Environment policy for acting that maximizes
Lifetime Reward · Finite horizon Rewards accumulate for a fixed period $100K+$100K+$100K=$300K · Infinite horizon Assume reward accumulates for ever $100K $100K +.. infinity · Discounting Future rewards not worth as much (a bird in hand.) Introduce discount factor y 100K+y100K+γ2$100K converges Will make the math work MDP Problem. State Reward Action Environment Given an environment model as a MDP create a policy for acting that maximizes lifetime reward V=ro+yG1+y'r2
15 Lifetime Reward • • • $100K + $100K + $100K = $300K • • • • • (a bird in hand …) • γ $100K + γ $100K + γ 2 $100K. . . • 16 MDP Problem: Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r0 + γ r1 + γ 2 r2 . . . Finite horizon: Rewards accumulate for a fixed period. Infinite horizon: Assume reward accumulates for ever $100K + $100K + . . . = infinity Discounting: Future rewards not worth as much Introduce discount factor converges Will make the math work Environment policy for acting that maximizes
MDP Problem: Policy State Reward Action Environment o Given an environment model as a MDP create a policy for acting that maximizes lifetime reward V=r+yr,+y2r2 assume deterministic world Policy n:S→A Selects an action for each state Optimal policy T": S7A Selects action for each state that maximizes lifetime reward 10 10 G G |10
17 MDP Problem: Policy Agent s0 r0 a0 s1 a1 r1 s2 a2 r2 s3 State Reward Action Given an environment model as a MDP create a lifetime reward V = r0 + γ r1 + γ 2 r2 . . . 18 Assume deterministic world Policy π : S ÆA • G 10 10 10 π G 10 10 10 π∗ Optimal policy π∗ : S ÆA • reward. Environment policy for acting that maximizes Selects an action for each state. Selects action for each state that maximizes lifetime