Figures for Sutton Barto Book: Reinforcement Learning: An Introduction Page 118 Small gPi diagram Figure 5.5 Blackjack Solution Figure 5.8 Two Racetracks Figure 6.2 TD(O) Backup Diagram Figure 6.3 Monte Carlo Driving Example Figure 6.4 TD Driving Example Figure 6.55 State random-Walk process Figure 6.6 Values Learned in a Sample run of Walks Figure 6.7 Learning of td and mC Methods on Walks Figure 6.8 Batch Performance of TD and MC Methods Page 143 You Are the Predictor Example Page 145 Sequence of States and Acti ons Figure 6. 10 Windy gridworld Figure 6. 1 1 Performance of Sarsa on Windy gridworld Figure 6. 13 Q-learning Backup Diagram Figure 6. 14 Cliff-Walking Task Figure 6.15 The Actor-Critic Architecture Figure 6. 17 Solution to Access-Control Queuing Task e Page 156 Tic-Tac-Toe After States Figure 7. 1 N-Step backups Figure 7.2 N-Step results · Page 169 Mixed Backup Figure 7.3 Backup Diagram for TD(lambda) Figure 7. 4 Weighting of Returns in lambda-return Figure 7.5 The Forward View Figure 7.6 lambda-return Algorithm Performance Page 173 Accumulating Traces e figure 7.8 The Backward View Figure 7.9 Performance of TD(lambda Figure 7.10 Sarsa(lambda)'s Backup Diagram Figure 7 12 Tabular Sarsa(lambda) Figure 7 13 Backup Diagram for Watkins's Q(lambda) Figure 7. 15 Backup Diagram for Pengs Q(lambda) Figure 7. 16 Accumulating and replacing traces Figure 7. 17 Error as a Function of Lambda Figure 7.18 The Right-Action Task le: /Cook/figures figures. html(2 of 4)[2808/138203: 12: 59 17A
Figures for Sutton & Barto Book: Reinforcement Learning: An Introduction ● Page 118 Small GPI Diagram ● Figure 5.5 Blackjack Solution ● Figure 5.8 Two Racetracks ● Figure 6.2 TD(0) Backup Diagram ● Figure 6.3 Monte Carlo Driving Example ● Figure 6.4 TD Driving Example ● Figure 6.5 5 State Random-Walk Process ● Figure 6.6 Values Learned in a Sample Run of Walks ● Figure 6.7 Learning of TD and MC Methods on Walks ● Figure 6.8 Batch Performance of TD and MC Methods ● Page 143 You Are the Predictor Example ● Page 145 Sequence of States and Actions ● Figure 6.10 Windy Gridworld ● Figure 6.11 Performance of Sarsa on Windy Gridworld ● Figure 6.13 Q-learning Backup Diagram ● Figure 6.14 Cliff-Walking Task ● Figure 6.15 The Actor-Critic Architecture ● Figure 6.17 Solution to Access-Control Queuing Task ● Page 156 Tic-Tac-Toe After States ● Figure 7.1 N-Step Backups ● Figure 7.2 N-Step Results ● Page 169 Mixed Backup ● Figure 7.3 Backup Diagram for TD(lambda) ● Figure 7.4 Weighting of Returns in lambda-return ● Figure 7.5 The Forward View ● Figure 7.6 lambda-return Algorithm Performance ● Page 173 Accumulating Traces ● Figure 7.8 The Backward View ● Figure 7.9 Performance of TD(lambda) ● Figure 7.10 Sarsa(lambda)'s Backup Diagram ● Figure 7.12 Tabular Sarsa(lambda) ● Figure 7.13 Backup Diagram for Watkins's Q(lambda) ● Figure 7.15 Backup Diagram for Peng's Q(lambda) ● Figure 7.16 Accumulating and Replacing Traces ● Figure 7.17 Error as a Function of Lambda ● Figure 7.18 The Right-Action Task file:///C|/book/figures/figures.html (2 of 4) [28/08/1382 03:12:59 ユネヘ]
Figures for Sutton Barto Book: Reinforcement Learning: An Introduction Figure 8.2 Coarse Coding e Figure 8. 3 Generalization via Coarse Coding Figure 8.4 Tile Width Affects Generalization Not Acuity Page 206 2D Grid Tiling Figure 8.5 Multiple, Overlapping gridtilings · Figure86 Tilings Page 207 One Hash-Coded Tile Figure 8.7 Radial Basis Functions Figure 8. 10 Mountain-Car Value Functions e figure 8.11 Mountain-Car results Figure 8 12 Bairds Counterexample Figure 8. 13 Blowup of Baird's Counterexample Figure 8. 14 Tsitsiklis and Van Roy's Counterexample Figure 8. 15 Summary Effect of Lambda Figure 9.2 Circle of Learning. Planning and acting Figure 9.3 The General Dyna Architecture Figure 9.5 Dyna Results Figure 9.6 Snapshot of Dyna Policies Figure 9.7 Results on Blocking Task Figure 9. 8 Results on Shortcut Task e Figure 9.10 Peng and williams Figure e Figure 9.1 1 Moore and Atkeson Figure Figure 9. 12 The One-Step backups Figure 9.13 Full vs Sample Backups Figure 9. 14 Uniform vs On-Policy Backups Figure 9. 15 Heuristic Search as One-Step backups Figure 10. 1 The Space of Backup Figure 11.1 A Backgammon Position Figure 11.2 TD-Gammon Network Figure 11.3 Backup Diagram for Samuel's Checker Player Figure 11. 4 The acrobot Figure 11.6 Performance on Acrobat Task Figure 11.7 Learned Behavior of Acrobat Figure 11. 8 Four Elevators Figure 11.9 Elevator Results le: /Cook/ figures figures. html( 3 of 4)[2808/138203: 12: 59 17A
Figures for Sutton & Barto Book: Reinforcement Learning: An Introduction ● Figure 8.2 Coarse Coding ● Figure 8.3 Generalization via Coarse Coding ● Figure 8.4 Tile Width Affects Generalization Not Acuity ● Page 206 2D Grid Tiling ● Figure 8.5 Multiple, Overlapping Gridtilings ● Figure 8.6 Tilings ● Page 207 One Hash-Coded Tile ● Figure 8.7 Radial Basis Functions ● Figure 8.10 Mountain-Car Value Functions ● Figure 8.11 Mountain-Car Results ● Figure 8.12 Baird's Counterexample ● Figure 8.13 Blowup of Baird's Counterexample ● Figure 8.14 Tsitsiklis and Van Roy's Counterexample ● Figure 8.15 Summary Effect of Lambda ● Figure 9.2 Circle of Learning, Planning and Acting ● Figure 9.3 The General Dyna Architecture ● Figure 9.5 Dyna Results ● Figure 9.6 Snapshot of Dyna Policies ● Figure 9.7 Results on Blocking Task ● Figure 9.8 Results on Shortcut Task ● Figure 9.10 Peng and Williams Figure ● Figure 9.11 Moore and Atkeson Figure ● Figure 9.12 The One-Step Backups ● Figure 9.13 Full vs Sample Backups ● Figure 9.14 Uniform vs On-Policy Backups ● Figure 9.15 Heuristic Search as One-Step Backups ● Figure 10.1 The Space of Backups ● Figure 11.1 A Backgammon Position ● Figure 11.2 TD-Gammon Network ● Figure 11.3 Backup Diagram for Samuel's Checker Player ● Figure 11.4 The Acrobot ● Figure 11.6 Performance on Acrobat Task ● Figure 11.7 Learned Behavior of Acrobat ● Figure 11.8 Four Elevators ● Figure 11.9 Elevator Results file:///C|/book/figures/figures.html (3 of 4) [28/08/1382 03:12:59 ユネヘ]
Figures for Sutton Barto Book: Reinforcement Learning: An Introduction Page 280 Channel Assignment Example Figure 11.10 Performance of Channel Allocation Methods Figure 11. 11 Comparison of Schedule Repairs Figure 11 12 Comparison of CPU Time le: /Cook/figures figures. html(4 of 4)[2808/138203: 12: 59 17A
Figures for Sutton & Barto Book: Reinforcement Learning: An Introduction ● Page 280 Channel Assignment Example ● Figure 11.10 Performance of Channel Allocation Methods ● Figure 11.11 Comparison of Schedule Repairs ● Figure 11.12 Comparison of CPU Time file:///C|/book/figures/figures.html (4 of 4) [28/08/1382 03:12:59 ユネヘ]
Errata and Notes for Sutton Barto Book: Reinforcement Learning: An Introduction Errata and notes for Reinforcement Learning An Introduction by richard s Sutton and andrew G. Barto Errata: p.155, the parameter alpha was 0.01, not 0. 1 as stated.(Abinay Garg en Van p. xviii, Ben Van Roy should be acknowledged only once in the list. B Roy) p. 233, last line of caption: "ne-step"should be"one-step".(Michael Naish) p 309, the reference for Tstisiklis and Van roy(1997b)should be to Technical Report LIDs-P 2390, Massachusetts Institute of Technology (Ben Van roy) p. 146, the windy gridworld example may have used alpha=0. 5 rather than alpha=0. I as stated Can you confirm this? p. 322, in the index entry for TD error, the range listed as174-165 should be"174-175".(Jette Randlov p. 197, bottom formula last theta t(2) should be theta t(n).(Dan Bernstein p. 151, second line of the equation, pi(s t, a t)should be pi(sit+1), a t). ( Dan Bernstein) p. 174, 181, 184, 200, 212, 213: in the boxed algorithms on all these pages, the setting of the eligibility traces to zero should appear not in the first line, but as a new first line inside the first loop gust after the"Repeat. "). Jim Reggia) p. 215, Figure 8.11, the y-axis label. first 20 trials"should be"first 20 episodes p. 215. The data shown in Figure 8 1 1 was apparently not generated exactly as described in the text, as its details(but not its overall shape) have defied replication. In particular, several researchers have reported best"steps per episode"in the 200-300 range p. 78. In the 2nd max equation for V*(h), at the end of the first line, "V*(h)"should be"V*() Christian Schulz p. 29. In the upper graph, the third line is unlabeled, but should be labeled"epsilon=0 (greedy) p. 212-213. In these two algorithms, a line is missing that is recommended, though perhaps not required. a next to the last line should be added, just before ending the loop that recomputes a That line would be Q a<.sum iin F a theta(i) p. 127, Figure 5.7. The first two lines of step(c)refer to pairs s, a and times t at or later than time \tau. In fact, it should only treat them for times later than tau, not equal. Thorsten Buchheim) p. 267, Table 11. 1. The number of hidden units for TD-Gammon 3.0 is given as 80, but should be 160(Michael Naish) p. 98, Figure 43. Stuart Reynolds points out that for some MDPs the given policy iteration algorithm never terminates. The problem is that there may be small changes in the values computed in step 2 that cause the policy to forever be changing in step 3. The solution is to e∥/| book/errata.hm(1of2)[2808138203:13:00
Errata and Notes for Sutton & Barto Book: Reinforcement Learning: An Introduction Errata and Notes for: Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto Errata: ● p. xviii, Ben Van Roy should be acknowledged only once in the list. (Ben Van Roy) ● p. 155, the parameter alpha was 0.01, not 0.1 as stated. (Abinav Garg) ● p. 233, last line of caption: "ne-step" should be "one-step". (Michael Naish) ● p. 309, the reference for Tstisiklis and Van Roy (1997b) should be to Technical Report LIDS-P- 2390, Massachusetts Institute of Technology. (Ben Van Roy) ● p. 146, the windy gridworld example may have used alpha=0.5 rather than alpha=0.1 as stated. Can you confirm this? ● p. 322, in the index entry for TD error, the range listed as "174-165" should be "174-175". (Jette Randlov) ● p. 197, bottom formula last theta_t(2) should be theta_t(n). (Dan Bernstein) ● p. 151, second line of the equation, pi(s_t,a_t) should be pi(s_{t+1},a_t). (Dan Bernstein) ● p. 174, 181, 184, 200, 212, 213: in the boxed algorithms on all these pages, the setting of the eligibility traces to zero should appear not in the first line, but as a new first line inside the first loop (just after the "Repeat..."). (Jim Reggia) ● p. 215, Figure 8.11, the y-axis label. "first 20 trials" should be "first 20 episodes". ● p. 215. The data shown in Figure 8.11 was apparently not generated exactly as described in the text, as its details (but not its overall shape) have defied replication. In particular, several researchers have reported best "steps per episode" in the 200-300 range. ● p. 78. In the 2nd max equation for V*(h), at the end of the first line, "V*(h)" should be "V*(l)". (Christian Schulz) ● p. 29. In the upper graph, the third line is unlabeled, but should be labeled "epsilon=0 (greedy)". ● p. 212-213. In these two algorithms, a line is missing that is recommended, though perhaps not required. A next to the last line should be added, just before ending the loop, that recomputes Q_a. That line would be Q_a <- \sum_{i\in F_a} theta(i). ● p. 127, Figure 5.7. The first two lines of step (c) refer to pairs s,a and times t at or later than time \tau. In fact, it should only treat them for times later than \tau, not equal. (Thorsten Buchheim) ● p. 267, Table 11.1. The number of hidden units for TD-Gammon 3.0 is given as 80, but should be 160. (Michael Naish) ● p. 98, Figure 4.3. Stuart Reynolds points out that for some MDPs the given policy iteration algorithm never terminates. The problem is that there may be small changes in the values computed in step 2 that cause the policy to forever be changing in step 3. The solution is to file:///C|/book/errata.html (1 of 2) [28/08/1382 03:13:00 ユネヘ]
Errata and Notes for Sutton Barto Book: Reinforcement Learning: An Introduction terminate step 3 not when the policy is stable, but as soon as the largest change in state value due to a policy change is less than some epsilon p. 259, the reference to McCallum, 1992 should be to Chrisman, 1992. And in the references section,on p. 302, the(incorrect)listing for McCallum, 1992 should not be there. Paul Crook) Notes: p. 212-213. In these two algorithms, it is implicit that the set of features for the terminal state(and all actions) is the empty set p. 28. The description of the 10-armed testbed could be clearer. Basically there are 2000 randomly generated 10-armed bandits. The Q*(a)of each of these were selected from a normal distribution with mean 0 and variance 1. Then, on each play with each bandit, the reward was determined by adding to Q*(a)another normally distributed random number with mean 0 and variance 1 p. 127, Figure 5.7. This algorithm is only valid if all policies are proper, meaning that they produce episodes that al ways eventually terminate(this assumption is made on the first page of the chapter ) This restriction on environments can be lifted if the algorithm is modified to use epsilon-soft policies, which are proper for all environments. Such a modification is a good exercise for the reader! Alternative ideas for off-policy Monte Carlo learning are discussed in this recent research paper John Tsitsiklis has obtained some new results which come very close to solving"one of the most important open theoretical questions in reinforcement learning-the convergence of Monte Carlo ES. See here The last equation on page 214 can be a little confusing. The minus sign here is meant to be grouped with the 0.0025 (as the spacing suggests ) Thus the consecutive plus and minus signs have the same effect as a single minus sign( Chris Hobbs) le: /Cook/errata. html(2 of 2)[28/ 08/382 03: 13: 0011
Errata and Notes for Sutton & Barto Book: Reinforcement Learning: An Introduction terminate step 3 not when the policy is stable, but as soon as the largest change in state value due to a policy change is less than some epsilon. ● p. 259, the reference to McCallum, 1992 should be to Chrisman, 1992. And in the references section, on p. 302, the (incorrect) listing for McCallum, 1992 should not be there. (Paul Crook) Notes: ● p. 212-213. In these two algorithms, it is implicit that the set of features for the terminal state (and all actions) is the empty set. ● p. 28. The description of the 10-armed testbed could be clearer. Basically there are 2000 randomly generated 10-armed bandits. The Q*(a) of each of these were selected from a normal distribution with mean 0 and variance 1. Then, on each play with each bandit, the reward was determined by adding to Q*(a) another normally distributed random number with mean 0 and variance 1. ● p. 127, Figure 5.7. This algorithm is only valid if all policies are proper, meaning that they produce episodes that always eventually terminate (this assumption is made on the first page of the chapter). This restriction on environments can be lifted if the algorithm is modified to use epsilon-soft policies, which are proper for all environments. Such a modification is a good exercise for the reader! Alternative ideas for off-policy Monte Carlo learning are discussed in this recent research paper. ● John Tsitsiklis has obtained some new results which come very close to solving "one of the most important open theoretical questions in reinforcement learning" -- the convergence of Monte Carlo ES. See here. ● The last equation on page 214 can be a little confusing. The minus sign here is meant to be grouped with the 0.0025 (as the spacing suggests). Thus the consecutive plus and minus signs have the same effect as a single minus sign. (Chris Hobbs) file:///C|/book/errata.html (2 of 2) [28/08/1382 03:13:00 ユネヘ]