1.3 Elements of Reinforcement Learni the long run. In decision-making and planning, the derived quantity called value is the one with which we are most concerned Unfortunately, it is also much harder to determine values than it is to determine rewards. Rewards are basically given directly by the environment, but values must be estimated and re estimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most mportant component of almost all reinforcement learning algorithms is a method for efficiently estimating values. The importance and centrality of estimating values is perhaps the most important thing we have learned about reinforcement learning in the last two decades Although all the reinforcement learning methods we consider in this book are structured around estimating value functions, it is not strictly necessary to do this to solve reinforcement learning problems For example, search methods such as genetic algorithms, genetic programming, simulated annealing, and other function optimization methods have been used to solve reinforcement learning problems. These methods search directly in the space of policies without ever appealing to value functions. We call these evolutionary methods because their operation is analogous to how biological evolution produces organisms with skilled behavior even though they do not themselves learn during their individual lifetimes. If the space of policies is sufficiently small, or can be structured so that good policies are common or easy to find, then evolutionary methods are often effective. In addition, evolutionary methods have advantages on problems in which the learning agent cannot accurately sense the state of its environment Nevertheless, what we mean by reinforcement learning involves learning while interacting with the environment, which evolutionary methods do not do. It is our belief that methods able to take advantage of the details of individual behavioral interactions can be much more efficient than evolutionary methods in a great many cases. Evolutionary methods ignore much of the useful structure of the reinforcement learning problem: they do not use the fact that the policy they are searching for is a function from states to actions; they do not notice which states an individual passes through during its lifetime, or which but more often it should enable more efficient search. Although evolution and learning share may ed actions it selects. In some cases this information can be misleading(e.g, when states are mis-perceived features and can naturally work together as they do in nature, we do not consider evolutionary methods by themselves to be especially well-suited to reinforcement learning problems. For simplicity, in this book when we use the term reinforcement learning"we do not include evolutionary methods The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced. The incorporation of models and planning into reinforcement learning systems is a relatively new development. Early reinforcement learning systems were explicitly trial-and-error learners; what they did was viewed as almost the opposite of planning. Nevertheless, it gradually became clear that reinforcement learning methods are closely related to dynamic programming methods, which do use models, and that they in turn are closely related to state-space planning methods. In Chapter 9 we explore reinforcement learning systems that simultaneously learn by trial and error, learn a model of the environment, and use the model for planning. Modern reinforcement learning spans the spectrum from fe:∥C|/ook∧hode4.hml(2of3)[2808138203:13:051
1.3 Elements of Reinforcement Learning the long run. In decision-making and planning, the derived quantity called value is the one with which we are most concerned. Unfortunately, it is also much harder to determine values than it is to determine rewards. Rewards are basically given directly by the environment, but values must be estimated and reestimated from the sequences of observations an agent makes over its entire lifetime. In fact, the most important component of almost all reinforcement learning algorithms is a method for efficiently estimating values. The importance and centrality of estimating values is perhaps the most important thing we have learned about reinforcement learning in the last two decades. Although all the reinforcement learning methods we consider in this book are structured around estimating value functions, it is not strictly necessary to do this to solve reinforcement learning problems. For example, search methods such as genetic algorithms, genetic programming, simulated annealing, and other function optimization methods, have been used to solve reinforcement learning problems. These methods search directly in the space of policies without ever appealing to value functions. We call these evolutionary methods because their operation is analogous to how biological evolution produces organisms with skilled behavior even though they do not themselves learn during their individual lifetimes. If the space of policies is sufficiently small, or can be structured so that good policies are common or easy to find, then evolutionary methods are often effective. In addition, evolutionary methods have advantages on problems in which the learning agent cannot accurately sense the state of its environment. Nevertheless, what we mean by reinforcement learning involves learning while interacting with the environment, which evolutionary methods do not do. It is our belief that methods able to take advantage of the details of individual behavioral interactions can be much more efficient than evolutionary methods in a great many cases. Evolutionary methods ignore much of the useful structure of the reinforcement learning problem: they do not use the fact that the policy they are searching for is a function from states to actions; they do not notice which states an individual passes through during its lifetime, or which actions it selects. In some cases this information can be misleading (e.g., when states are mis-perceived), but more often it should enable more efficient search. Although evolution and learning share many features and can naturally work together as they do in nature, we do not consider evolutionary methods by themselves to be especially well-suited to reinforcement learning problems. For simplicity, in this book when we use the term ``reinforcement learning" we do not include evolutionary methods. The fourth and final element of some reinforcement learning systems is a model of the environment. This is something that mimics the behavior of the environment. For example, given a state and action, the model might predict the resultant next state and next reward. Models are used for planning, by which we mean any way of deciding on a course of action by considering possible future situations before they are actually experienced. The incorporation of models and planning into reinforcement learning systems is a relatively new development. Early reinforcement learning systems were explicitly trial-and-error learners; what they did was viewed as almost the opposite of planning. Nevertheless, it gradually became clear that reinforcement learning methods are closely related to dynamic programming methods, which do use models, and that they in turn are closely related to state-space planning methods. In Chapter 9 we explore reinforcement learning systems that simultaneously learn by trial and error, learn a model of the environment, and use the model for planning. Modern reinforcement learning spans the spectrum from file:///C|/book/1/node4.html (2 of 3) [28/08/1382 03:13:05 ユネヘ]
1.3 Elements of Reinforcement Learni low-level, trial-and-error learning to high-level, deliberative planning file://c!book/ /hode 4. html( 3 of 3)[28/08/382 03: 13: 05 2*]
1.3 Elements of Reinforcement Learning low-level, trial-and-error learning to high-level, deliberative planning. file:///C|/book/1/node4.html (3 of 3) [28/08/1382 03:13:05 ユネヘ]
1.4 An Extended Example: Tic-Tac-To 1.4 An Extended Example: Tic-Tac-Toe To illustrate the general idea of reinforcement learning and contrast it with other approaches, we next consider a single example in more detail Consider the familiar childs game of Tic-Tac-Toe. Two players take turns playing on a three-by-three board. One player plays X s and the other o s until one player wins by placing three of his marks in a row, horizontally, vertically, or diagonally, as the X player has in this game XIAO X If the board fills up with neither player getting three in a row, the game is a draw. Because a skilled player can play so that he never loses, let us assume that we are playing against an imperfect player, one whose play is sometimes incorrect and allows us to win. For the moment, in fact, let us consider draws and losses to be equally bad for us. How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its chances of winning Although this is a very simple problem, it cannot readily be solved in a satisfactory way by classical techniques. For example the classical minimax" solution from game theory is not correct here because it assumes a particular way of playing by the opponent. For example, a minimax player would never reach a game state from which it could lose, even if in fact it always won from that state because of incorrect play by the opponent. Classical optimization methods for sequential decision problems, such as dynamic programming, can compute an optimal solution for any opponent, but require as input a complete specification of that opponent, including the probabilities with which the opponent makes each move in each board state. Let us assume that this information is not available a priori for this problem, as it is not for the vast majority of problems of practical interest. On the other hand, such information can be estimated from experience, in this case by playing many games against the opponent. about the best one can do on this problem is to first learn a model of the opponent's behavior, up to some level of confidence, and then apply dynamic programming to compute an optimal solution given the approximate opponent model. In the end, this is not that different from some of the reinforcement learning methods we examine later in this book An evolutionary approach to this problem would directly search the space of possible policies for one with a high probability of winning against the opponent. Here, a policy is a rule that tells the player what move to make for every state of the game---every possible configuration of X s and o s on the three-by three board. For each policy considered, an estimate of its winning probability would be obtained by playing some number of games against the opponent. This evaluation would then direct which policy or fe:∥C|/ook∧hode5hm(1of5)[280838203:13:08
1.4 An Extended Example: Tic-Tac-Toe 1.4 An Extended Example: Tic-Tac-Toe To illustrate the general idea of reinforcement learning and contrast it with other approaches, we next consider a single example in more detail. Consider the familiar child's game of Tic-Tac-Toe. Two players take turns playing on a three-by-three board. One player plays X s and the other Ø s until one player wins by placing three of his marks in a row, horizontally, vertically, or diagonally, as the `X ' player has in this game: If the board fills up with neither player getting three in a row, the game is a draw. Because a skilled player can play so that he never loses, let us assume that we are playing against an imperfect player, one whose play is sometimes incorrect and allows us to win. For the moment, in fact, let us consider draws and losses to be equally bad for us. How might we construct a player that will find the imperfections in its opponent's play and learn to maximize its chances of winning? Although this is a very simple problem, it cannot readily be solved in a satisfactory way by classical techniques. For example, the classical ``minimax" solution from game theory is not correct here because it assumes a particular way of playing by the opponent. For example, a minimax player would never reach a game state from which it could lose, even if in fact it always won from that state because of incorrect play by the opponent. Classical optimization methods for sequential decision problems, such as dynamic programming, can compute an optimal solution for any opponent, but require as input a complete specification of that opponent, including the probabilities with which the opponent makes each move in each board state. Let us assume that this information is not available a priori for this problem, as it is not for the vast majority of problems of practical interest. On the other hand, such information can be estimated from experience, in this case by playing many games against the opponent. About the best one can do on this problem is to first learn a model of the opponent's behavior, up to some level of confidence, and then apply dynamic programming to compute an optimal solution given the approximate opponent model. In the end, this is not that different from some of the reinforcement learning methods we examine later in this book. An evolutionary approach to this problem would directly search the space of possible policies for one with a high probability of winning against the opponent. Here, a policy is a rule that tells the player what move to make for every state of the game---every possible configuration of X s and Ø s on the three-bythree board. For each policy considered, an estimate of its winning probability would be obtained by playing some number of games against the opponent. This evaluation would then direct which policy or file:///C|/book/1/node5.html (1 of 5) [28/08/1382 03:13:08 ユネヘ]
1.4 An Extended Example: Tic-Tac-Tc policies were next considered. A typical evolutionary method would hillclimb in policy space successively generating and evaluating policies in an attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm could be used which would maintain and evaluate a population of policies. Literally hundreds of different optimization algorithms could be applied. By directly searching the policy space we mean that entire policies are proposed and compared on the basis of scalar evaluations Here is how the Tic-Tac-Toe problem would be approached using reinforcement learning and approximate value functions. First we set up a table of numbers, one for each possible state of the game Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's current value. and the whole table is the learned value function State a has higher value than state B, or is considered better"than state B, if the current estimate of the probability of our winning from a is higher than it is from B. Assuming we always play X s, then for all states with three X s in a row the probability of winning is l, because we have already won. Similarly, for all states with three O s in a row, or that are filled up", the correct probability is 0, as we cannot win from them. We set the initial values of all the other states, the nonterminals, to 0.5, representing an informed guess that we have a 50% chance of winning Now we play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves(one for each blank space on the board) and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however,we select randomly from one of the other moves instead; these are called exploratory moves because the cause us to experience states that we might otherwise never see. a sequence of moves made and (hey considered during a game can be diagrammed as in Figure 1.1 Startig Position OPpoNeNts Move Our Move OppoNents Move Our how poEms Move Our Mowe Figure 1.1: A sequence of Tic-Tac-Toe moves. The solid lines represent the moves taken during a game fe:∥C|/ook∧hode5hml(2of5)[2808∧38203:13:08
1.4 An Extended Example: Tic-Tac-Toe policies were next considered. A typical evolutionary method would hillclimb in policy space, successively generating and evaluating policies in an attempt to obtain incremental improvements. Or, perhaps, a genetic-style algorithm could be used which would maintain and evaluate a population of policies. Literally hundreds of different optimization algorithms could be applied. By directly searching the policy space we mean that entire policies are proposed and compared on the basis of scalar evaluations. Here is how the Tic-Tac-Toe problem would be approached using reinforcement learning and approximate value functions. First we set up a table of numbers, one for each possible state of the game. Each number will be the latest estimate of the probability of our winning from that state. We treat this estimate as the state's current value, and the whole table is the learned value function. State A has higher value than state B, or is considered ``better" than state B, if the current estimate of the probability of our winning from A is higher than it is from B. Assuming we always play X s, then for all states with three X s in a row the probability of winning is 1, because we have already won. Similarly, for all states with three Ø s in a row, or that are ``filled up", the correct probability is 0, as we cannot win from them. We set the initial values of all the other states, the nonterminals, to 0.5, representing an informed guess that we have a 50% chance of winning. Now we play many games against the opponent. To select our moves we examine the states that would result from each of our possible moves (one for each blank space on the board) and look up their current values in the table. Most of the time we move greedily, selecting the move that leads to the state with greatest value, that is, with the highest estimated probability of winning. Occasionally, however, we select randomly from one of the other moves instead; these are called exploratory moves because they cause us to experience states that we might otherwise never see. A sequence of moves made and considered during a game can be diagrammed as in Figure 1.1. Figure 1.1: A sequence of Tic-Tac-Toe moves. The solid lines represent the moves taken during a game; file:///C|/book/1/node5.html (2 of 5) [28/08/1382 03:13:08 ユネヘ]
1.4 An Extended Example: Tic-Tac-To the dashed lines represent moves that we(our algorithm) considered but did not make. Our second move was an exploratory move, meaning that is was taken even though another sibling move, that leading to e, was ranked higher. Exploratory moves do not result in any learning, but each of our other moves do causing backups as suggested by the curved arrows and detailed in the text While we are playing, we change the values of the states in which we find ourselves during the game We attempt to make them more accurate estimates of the probabilities of winning from those states. te do this, we back up"the value of the state after each greedy move to the state before the move, as suggested by the arrows in Figure 1. 1. More precisely, the current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state 's value a fraction of the way toward the value of the later state. If we let s denote the state before the greedy move, and s the state after, then the update to the estimated value of s, denoted v(s), can be written V(s)←V(s)+aⅴ(s)-V(s)] where Ce is a small positive fraction called the step-size parameter, which influences the rate of learnin This update rule is an example of a temporal-difference learning method, so called because it's changes are based on a difference, V(s")-v(s). between estimates at two different times The method described above performs quite well on this task. For example, if the step-size parameter is reduced properly over time, this method converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player. Furthermore, the moves then taken(except on exploratory moves) are in fact the optimal moves against the opponent. In other words, the method converges to an optimal policy for playing the game. If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that change their way of playing slowly over time This example illustrates the differences between evolutionary methods and methods that learn value functions. To evaluate a policy, an evolutionary method must hold it fixed and play many games against the opponent, or simulate many games using a model of the opponent. The frequency of wins gives an unbiased estimate of the probability of winning with that policy, and can be used to direct the next policy selection. But each policy change is made only after many games, and only the final outcome of each game is used what happens during the games is ignored. For example, if the player wins, then all of its behavior in the game is given credit, independently of how specific moves might have been critical to the win. Credit is even given to moves that never occurred! value function methods, in contrast, allow ndividual states to be evaluated. In the end, both evolutionary and value function methods search the space of policies, but learning a value function takes advantage of information available during the course of play This example is very simple, but it illustrates some of the key features of reinforcement learning fe:∥C|/ook∧hode5hm(3of5)[280838203:13:08
1.4 An Extended Example: Tic-Tac-Toe the dashed lines represent moves that we (our algorithm) considered but did not make. Our second move was an exploratory move, meaning that is was taken even though another sibling move, that leading to , was ranked higher. Exploratory moves do not result in any learning, but each of our other moves do, causing backups as suggested by the curved arrows and detailed in the text. While we are playing, we change the values of the states in which we find ourselves during the game. We attempt to make them more accurate estimates of the probabilities of winning from those states. To do this, we ``back up" the value of the state after each greedy move to the state before the move, as suggested by the arrows in Figure 1.1. More precisely, the current value of the earlier state is adjusted to be closer to the value of the later state. This can be done by moving the earlier state's value a fraction of the way toward the value of the later state. If we let s denote the state before the greedy move, and the state after, then the update to the estimated value of s, denoted , can be written: where is a small positive fraction called the step-size parameter, which influences the rate of learning. This update rule is an example of a temporal-difference learning method, so called because it's changes are based on a difference, , between estimates at two different times. The method described above performs quite well on this task. For example, if the step-size parameter is reduced properly over time, this method converges, for any fixed opponent, to the true probabilities of winning from each state given optimal play by our player. Furthermore, the moves then taken (except on exploratory moves) are in fact the optimal moves against the opponent. In other words, the method converges to an optimal policy for playing the game. If the step-size parameter is not reduced all the way to zero over time, then this player also plays well against opponents that change their way of playing slowly over time. This example illustrates the differences between evolutionary methods and methods that learn value functions. To evaluate a policy, an evolutionary method must hold it fixed and play many games against the opponent, or simulate many games using a model of the opponent. The frequency of wins gives an unbiased estimate of the probability of winning with that policy, and can be used to direct the next policy selection. But each policy change is made only after many games, and only the final outcome of each game is used: what happens during the games is ignored. For example, if the player wins, then all of its behavior in the game is given credit, independently of how specific moves might have been critical to the win. Credit is even given to moves that never occurred! Value function methods, in contrast, allow individual states to be evaluated. In the end, both evolutionary and value function methods search the space of policies, but learning a value function takes advantage of information available during the course of play. This example is very simple, but it illustrates some of the key features of reinforcement learning file:///C|/book/1/node5.html (3 of 5) [28/08/1382 03:13:08 ユネヘ]