《电子商务 E-business》阅读文献：Reinforcement Learning An Introduction.pdf_P31-P35

1.4 An Extended Example: Tic-Tac-Tc methods. First, there is the emphasis on learning while interacting with an environment, in this case with an opponent player. Second, there is a clear goal, and correct behavior requires planning or foresight that takes into account delayed effects of one's choices For example, the simple reinforcement learning player would learn to set up multi-move traps for a short-sighted opponent. It is a striking feature of the reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without carrying out an explicit search over possible sequences of future states and actions While this example illustrates some of the key features of reinforcement learning, it is so simple that it might give the impression that reinforcement learning is more limited than it really is. Although Tic-Tac Toe is a two-person game, reinforcement learning also applies when there is no explicit external adversary, that is, in the case of a game against nature. Reinforcement learning is also not restricted to problems in which behavior breaks down into separate episodes, like the separate games of Tic-Tac- Toe, with reward only at the end of each episode It is just as applicable when behavior continues indefinitely and when rewards of various magnitudes can be received at any time Tic-Tac-Toe has a relatively small, finite state set, whereas reinforcement learning can be applied to ver large, or even infinite, state sets. For example, Gerry Tesauro(1992, 1995) combined the algorithm described above with an artificial neural network to learn to play the game of backgammon, which has approximately 10 states. Notice that with this many states it is impossible to ever experience more than a small fraction of them. Tesauro's program learned to play far better than any previous program and now plays at the level of the world's best human players(see Chapter 11). The neural network provides the program with the ability to generalize from its experience, so that in new states it selects moves based on information saved from similar states faced in the past, as determined by its network How well a reinforcement learning agent can work in problems with such large state sets is intimately tied to how appropriately it can generalize from past experience. It is in this role that we have the greatest need for supervised learning methods with reinforcement learning Neural networks are not the only, or necessarily the best, way to do this reinforcement learning by no means entails a tabula rasa view of learning and intelligence. On the e,but In this Tic-Tac-Toe example, learning started with no prior knowledge beyond the rules of the game, but contrary, prior information can be incorporated into reinforcement learning in a variety of ways that can be critical for efficient learning. We also had access to the true state in the Tic-Tac-Toe example, whereas reinforcement learning can also be applied when part of the state is hidden, or when different not cover It significantly in this book e states appear to the learner to be the same. That case, however, is substantially more difficult, and we do Finally, the Tic-Tac-Toe player was able to look ahead and know the states that would result from each of its possible moves. To do this, it had to have a model of the game that allows it to think about "how its environment would change in response to moves that it may never make. Many problems are like this but in others even a short -term model of the effects of actions is lacking. Reinforcement learning can be applied in either case. No model is required, but models can easily be used if they are available or can be fe:∥C|/ook∧hode5hm(4of5)[280838203:13:08

1.4 An Extended Example: Tic-Tac-Toe methods. First, there is the emphasis on learning while interacting with an environment, in this case with an opponent player. Second, there is a clear goal, and correct behavior requires planning or foresight that takes into account delayed effects of one's choices. For example, the simple reinforcement learning player would learn to set up multi-move traps for a short-sighted opponent. It is a striking feature of the reinforcement learning solution that it can achieve the effects of planning and lookahead without using a model of the opponent and without carrying out an explicit search over possible sequences of future states and actions. While this example illustrates some of the key features of reinforcement learning, it is so simple that it might give the impression that reinforcement learning is more limited than it really is. Although Tic-TacToe is a two-person game, reinforcement learning also applies when there is no explicit external adversary, that is, in the case of a ``game against nature." \ Reinforcement learning is also not restricted to problems in which behavior breaks down into separate episodes, like the separate games of Tic-TacToe, with reward only at the end of each episode. It is just as applicable when behavior continues indefinitely and when rewards of various magnitudes can be received at any time. Tic-Tac-Toe has a relatively small, finite state set, whereas reinforcement learning can be applied to very large, or even infinite, state sets. For example, Gerry Tesauro (1992, 1995) combined the algorithm described above with an artificial neural network to learn to play the game of backgammon, which has approximately states. Notice that with this many states it is impossible to ever experience more than a small fraction of them. Tesauro's program learned to play far better than any previous program, and now plays at the level of the world's best human players (see Chapter 11 ). The neural network provides the program with the ability to generalize from its experience, so that in new states it selects moves based on information saved from similar states faced in the past, as determined by its network. How well a reinforcement learning agent can work in problems with such large state sets is intimately tied to how appropriately it can generalize from past experience. It is in this role that we have the greatest need for supervised learning methods with reinforcement learning. Neural networks are not the only, or necessarily the best, way to do this. In this Tic-Tac-Toe example, learning started with no prior knowledge beyond the rules of the game, but reinforcement learning by no means entails a tabula rasa view of learning and intelligence. On the contrary, prior information can be incorporated into reinforcement learning in a variety of ways that can be critical for efficient learning. We also had access to the true state in the Tic-Tac-Toe example, whereas reinforcement learning can also be applied when part of the state is hidden, or when different states appear to the learner to be the same. That case, however, is substantially more difficult, and we do not cover it significantly in this book. Finally, the Tic-Tac-Toe player was able to look ahead and know the states that would result from each of its possible moves. To do this, it had to have a model of the game that allows it to ``think about" how its environment would change in response to moves that it may never make. Many problems are like this, but in others even a short-term model of the effects of actions is lacking. Reinforcement learning can be applied in either case. No model is required, but models can easily be used if they are available or can be file:///C|/book/1/node5.html (4 of 5) [28/08/1382 03:13:08 ﾕﾈﾍ]

1.6 History of Reinforcement Learning The history of reinforcement learning has two main threads, both long and rich, which were pursued independently before intertwining in modern reinforcement learning. One thread concerns learning by trial and error and started in the psychology of animal learning. This thread runs through some of the earliest work in artificial intelligence and led to the revival of reinforcement learning in the early 1980s The other thread concerns the problem of optimal control and its solution using value functions and dynamic programming. For the most part, this thread did not involve learning. Although the two thread were largely independent, the exceptions revolve around a third, less distinct thread concerning temporal difference methods such as used in the Tic-Tac-Toe example in this chapter. all three threads came together in the late 1980s to produce the modern field of reinforcement learning as we present it in this book The thread focusing on trial-and-error learning is the one with which we are most familiar and about which we have the most to say in this brief history. Before doing that, however, we briefly discuss the optimal control thread The term optimal control"came into use in the late 1950s to describe the problem of designing a controller to minimize a measure of a dynamical system's behavior over time. One of the approaches to this problem was developed in the mid-1950s by richard Bellman and colleagues by extending a 19th century theory of Hamilton and Jacobi. This approach uses the concept of a dynamical system s state and of a value function, or optimal return function, " to define a functional equation, now often called the Bellman equation. The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming(Bellman, 1957a). Bellman(1957b ) also introduced the discrete stochastic version of the optimal control problem known as Markovian decision processes (MDPs), and Ron Howard(1960) devised the policy iteration method for MDPs. All of these are essential elements underlying the theory and algorithms of modern reinforcement learnin control problems. It suffers from what Bellman called"the curse of dimensionality, "meaning that i e Dynamic programming is widely considered the only feasible way of solving general stochastic optil computational requirements grow exponentially with the number of state variables, but it is still far more efficient and more widely applicable than any other method Dynamic programming has been extensively developed in the last four decades, including extensions to partially observable MDPs(surveyed by Lovejoy, 1991), many applications(surveyed by White, 1985, 1988, 1993), approximation methods (surveyed by Rust, 1996), and asynchronous methods(Bertsekas, 1982, 1983). Many excellent modern ts of dynamic programming are available(e. g, Bertsekas, 1995; Puterman, 1994; RoSs, 1983 and Whittle, 1982, 1983). Bryson(1996) provides a detailed authoritative history of optimal control In this book, we consider all of the work on optimal control to also be work in reinforcement learning fe:∥C|/ook∧hode7hm(1of7)[280838203:13:56

1.6 History of Reinforcement Learning 1.6 History of Reinforcement Learning The history of reinforcement learning has two main threads, both long and rich, which were pursued independently before intertwining in modern reinforcement learning. One thread concerns learning by trial and error and started in the psychology of animal learning. This thread runs through some of the earliest work in artificial intelligence and led to the revival of reinforcement learning in the early 1980s. The other thread concerns the problem of optimal control and its solution using value functions and dynamic programming. For the most part, this thread did not involve learning. Although the two threads were largely independent, the exceptions revolve around a third, less distinct thread concerning temporaldifference methods such as used in the Tic-Tac-Toe example in this chapter. All three threads came together in the late 1980s to produce the modern field of reinforcement learning as we present it in this book. The thread focusing on trial-and-error learning is the one with which we are most familiar and about which we have the most to say in this brief history. Before doing that, however, we briefly discuss the optimal control thread. The term ``optimal control" came into use in the late 1950s to describe the problem of designing a controller to minimize a measure of a dynamical system's behavior over time. One of the approaches to this problem was developed in the mid-1950s by Richard Bellman and colleagues by extending a 19th century theory of Hamilton and Jacobi. This approach uses the concept of a dynamical system's state and of a value function, or ``optimal return function," to define a functional equation, now often called the Bellman equation. The class of methods for solving optimal control problems by solving this equation came to be known as dynamic programming (Bellman, 1957a). Bellman (1957b) also introduced the discrete stochastic version of the optimal control problem known as Markovian decision processes (MDPs), and Ron Howard (1960) devised the policy iteration method for MDPs. All of these are essential elements underlying the theory and algorithms of modern reinforcement learning. Dynamic programming is widely considered the only feasible way of solving general stochastic optimal control problems. It suffers from what Bellman called ``the curse of dimensionality," meaning that its computational requirements grow exponentially with the number of state variables, but it is still far more efficient and more widely applicable than any other method. Dynamic programming has been extensively developed in the last four decades, including extensions to partially observable MDPs (surveyed by Lovejoy, 1991), many applications (surveyed by White, 1985, 1988, 1993), approximation methods (surveyed by Rust, 1996), and asynchronous methods (Bertsekas, 1982, 1983). Many excellent modern treatments of dynamic programming are available (e.g., Bertsekas, 1995; Puterman, 1994; Ross, 1983; and Whittle, 1982, 1983). Bryson (1996) provides a detailed authoritative history of optimal control. In this book, we consider all of the work on optimal control to also be work in reinforcement learning. file:///C|/book/1/node7.html (1 of 7) [28/08/1382 03:13:56 ﾕﾈﾍ]

1.6 History of Reinforcement Learnin We define reinforcement learning as any effective way of solving reinforcement learning problems, and it is now clear that these problems are very closely related to optimal control problems, particularly those formulated as MDPs. Accordingly we must consider the solution methods of optimal control, such as dynamic programming, to also be reinforcement learning methods. Of course, almost all of these methods require complete knowledge of the system to be controlled and for this reason it feels a little unnatural to say that they are part of reinforcement learning. On the other hand, many dynamic programming methods are incremental and iterative. Like true learning methods, they gradually reach the correct answer through successive approximations as we show in the rest of this book, these similarities are far more than superficial. The theories and solution methods for the cases of complete and incomplete knowledge are so closely related that we feel they must be considered together as part of the same subject matter Let us return now to the other major thread leading to the modern field of reinforcement learning, that centered around the idea of trial-and-error learning. This thread began in psychology, where reinforcement" theories of learning are common. Perhaps the first to succintly express the essence of trial-and-error learning was edward Thorndike We take this essence to be the idea that actions followed by good or bad outcomes have their tendency to be re-selected altered accordingly. In Thorndike 's words Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911, p. 244) Thorndike called this the Law of effect"because it describes the effect of reinforcing events on the tendency to select actions. Although sometimes controversial(e.g, see Kimble, 1961, 1967; Mazur 1994)the Law of Effect is widely regarded as an obvious basic principle underlying much behavior(e. g Hilgard and Bower, 1975; Dennett, 1978; Campbell, 1958: Cziko, 1995) The Law of Effect includes the two most important aspects of what we mean by trial-and-error learning First, it is selectional rather than instructional in the sense of monod, meaning that it involves trying alternatives and selecting among them by comparing their consequences. Second, it is associative, meaning that the alternatives found by selection are associated with particular situations. Natural selection in evolution is a prime example of a selectional process, but it is not associative. Supervised learning is associative but not selectional it is the combination of these two that is essential to the law of Effect and to trial-and-error learning. Another way of saying this is that the Law of Effect is an elementary way of combining search and memory: search in the form of trying and selecting among many actions in each situation, and memory in the form of remembering what actions worked best associating them with the situations in which they were best. Combining search and memory in this way is essential to reinforcement learning fe:∥C|/ook∧hode7.hml(2of7)[2808∧38203:13:56

1.6 History of Reinforcement Learning We define reinforcement learning as any effective way of solving reinforcement learning problems, and it is now clear that these problems are very closely related to optimal control problems, particularly those formulated as MDPs. Accordingly we must consider the solution methods of optimal control, such as dynamic programming, to also be reinforcement learning methods. Of course, almost all of these methods require complete knowledge of the system to be controlled, and for this reason it feels a little unnatural to say that they are part of reinforcement learning. On the other hand, many dynamic programming methods are incremental and iterative. Like true learning methods, they gradually reach the correct answer through successive approximations. As we show in the rest of this book, these similarities are far more than superficial. The theories and solution methods for the cases of complete and incomplete knowledge are so closely related that we feel they must be considered together as part of the same subject matter. Let us return now to the other major thread leading to the modern field of reinforcement learning, that centered around the idea of trial-and-error learning. This thread began in psychology, where ``reinforcement" theories of learning are common. Perhaps the first to succintly express the essence of trial-and-error learning was Edward Thorndike. We take this essence to be the idea that actions followed by good or bad outcomes have their tendency to be re-selected altered accordingly. In Thorndike's words: Of several responses made to the same situation, those which are accompanied or closely followed by satisfaction to the animal will, other things being equal, be more firmly connected with the situation, so that, when it recurs, they will be more likely to recur; those which are accompanied or closely followed by discomfort to the animal will, other things being equal, have their connections with that situation weakened, so that, when it recurs, they will be less likely to occur. The greater the satisfaction or discomfort, the greater the strengthening or weakening of the bond. (Thorndike, 1911, p. 244) Thorndike called this the ``Law of Effect" because it describes the effect of reinforcing events on the tendency to select actions. Although sometimes controversial (e.g., see Kimble, 1961, 1967; Mazur, 1994) the Law of Effect is widely regarded as an obvious basic principle underlying much behavior (e.g., Hilgard and Bower, 1975; Dennett, 1978; Campbell, 1958; Cziko, 1995). The Law of Effect includes the two most important aspects of what we mean by trial-and-error learning. First, it is selectional rather than instructional in the sense of Monod, meaning that it involves trying alternatives and selecting among them by comparing their consequences. Second, it is associative, meaning that the alternatives found by selection are associated with particular situations. Natural selection in evolution is a prime example of a selectional process, but it is not associative. Supervised learning is associative, but not selectional. It is the combination of these two that is essential to the Law of Effect and to trial-and-error learning. Another way of saying this is that the Law of Effect is an elementary way of combining search and memory: search in the form of trying and selecting among many actions in each situation, and memory in the form of remembering what actions worked best, associating them with the situations in which they were best. Combining search and memory in this way is essential to reinforcement learning. file:///C|/book/1/node7.html (2 of 7) [28/08/1382 03:13:56 ﾕﾈﾍ]