Partially Observable Markov Decision processes Additional reading Leslie Pack Kaelbling, Michael L Littman and Anthony R. Cassandra, "Planning and cting in Partially Observable Stochastic Domains, "Artificial Intelligence, Vol. 101 1998 J Pineau, G. Gordon &s. Thrun"Point-based value iteration: An anytime algorithm for POMDPs". International Joint Conference on Artificial Intelligence(IJCAI) Acapulco, Mexico. Aug. 2003 Ssues Problem statement If we do not know the current state of the world what should we do to act optimally? Input Model of states actions observations transition and emission functions reward function initial distribution and discount factor Outputs from different algorithms Complete/partial, exact/approximate value function Complete/partial, exact/approximate finite state machine Choices Policy Vs plan Exact VS approximate Representation Discrete vs, continuous states actions observations
Partially Observable Markov Decision Processes Ɣ Additional reading: Ɣ Acting in Partially Observable Stochastic Domains,'' Artificial Intelligence, Vol. 101, 1998. Ɣ Issues Ɣ Problem statement: – Ɣ Inputs: – Ɣ Outputs from different algorithms: – Complete/partial, exact/approximate value function – Complete/partial, exact/approximate finite state machine Ɣ Choices: – Policy vs. plan – Exact vs. approximate – Representation – Discrete vs. continuous states, actions, observations Leslie Pack Kaelbling, Michael L. Littman and Anthony R. Cassandra, ``Planning and J. Pineau, G. Gordon & S. Thrun "Point-based value iteration: An anytime algorithm for POMDPs''. International Joint Conference on Artificial Intelligence (IJCAI). Acapulco, Mexico. Aug. 2003. If we do not know the current state of the world, what should we do to act optimally? Model of states, actions, observations, transition and emission functions, reward function, initial distribution and discount factor
Graphical Models actions Sondik. 1971 Beliefs T(S/ ai, s) Observations O(;/ s hidden states Rewards RI Reliable navigation Conventional trajectories may not be robust to localization error Estimated robot position o True robot position o Goal position o
Graphical Models Sondik, 1971 States S1 Rewards R1 S2 T(sj |ai , si ) Z2 b Beliefs 1 Observations Z1 a Actions 1 O(zj |si ) b2 Hidden Observable Reliable Navigation Ɣ Conventional trajectories may not be robust to localization error Estimated robot position True robot position Goal position
Control models Markov Decision Processes Probabilistic P(x) Control Mode World state o Partially Observable Markov Decision Processes Probabilistic Perception P(x) Control Model World state Navigation as a POmdp Controller chooses actions based on probability distributions Action, observation State space State is hidden from the controller
Ɣ Markov Decision Processes Control Models Probabilistic Perception Model P(x) Control World state World state Probabilistic Perception Model P(x) Control Partially Observable Markov Decision Processes Navigation as a POMDP State Space State is hidden from the controller Action, observation Controller chooses actions based on probability distributions argmax P(x)
Passive Controlled Observable Markov Model MDP Hidden State HMM POMDP Stationary policy: Best action is fixed Non-stationary policy: Best action depends on time States can be discrete, continuous, or hybrid Tradeoffs MDP tractable to solve relatively easy to specify Assumes perfect knowledge of state POMDP treats all sources of uncertainty(action, sensing, environment )in a uniform framework Allows for taking actions that gain information Difficult to specify all the conditional probabilities Hugely intractable to solve optimally POMDP Advantages Models information gathering Computes trade-off between Getting reward Being uncertain Am I here, or over there? 人人
Ɣ Stationary policy: Best action is fixed Ɣ Non-stationary Ɣ States can be , continuous, or hybrid Tradeoffs Ɣ MDP + + – Assumes perfect knowledge of state Ɣ POMDP + in a uniform framework + – – Hugely intractable to solve optimally Hidden State HMM POMDP Markov Model MDP Fully Observable Passive Controlled POMDP Advantages Ɣ Models information gathering Ɣ Computes trade-off between: Ɣ Getting reward Ɣ Being uncertain Am I here, or over there? policy: Best action depends on time discrete Tractable to solve Relatively easy to specify Treats all sources of uncertainty (action, sensing, environment) Allows for taking actions that gain information Difficult to specify all the conditional probabilities
A Simple POmdp p(s) State is hidden S2 POMDP Policies Current belief p(s) Optimal action Belief Space
A Simple POMDP p(s) s1 s2 s3 State is hidden POMDP Policies