当前位置：和泉文库 > 计算机 > 浏览文档

《电子商务 E-business》阅读文献：Reinforcement Learning An Introduction

文件格式：PDF，文件大小：2.57MB，售价：67.35元

文档详细内容（约369页）

Errata and Notes for Sutton Barto Book: Reinforcement Learning: An Introduction Errata and notes for Reinforcement Learning An Introduction by richard s Sutton and andrew G. Barto Errata: p.155, the parameter alpha was 0.01, not 0. 1 as stated.(Abinay Garg en Van p. xviii, Ben Van Roy should be acknowledged only once in the list. B Roy) p. 233, last line of caption: "ne-step"should be"one-step".(Michael Naish) p 309, the reference for Tstisiklis and Van roy(1997b)should be to Technical Report LIDs-P 2390, Massachusetts Institute of Technology (Ben Van roy) p. 146, the windy gridworld example may have used alpha=0. 5 rather than alpha=0. I as stated Can you confirm this? p. 322, in the index entry for TD error, the range listed as174-165 should be"174-175".(Jette Randlov p. 197, bottom formula last theta t(2) should be theta t(n).(Dan Bernstein p. 151, second line of the equation, pi(s t, a t)should be pi(sit+1), a t). ( Dan Bernstein) p. 174, 181, 184, 200, 212, 213: in the boxed algorithms on all these pages, the setting of the eligibility traces to zero should appear not in the first line, but as a new first line inside the first loop gust after the"Repeat. "). Jim Reggia) p. 215, Figure 8.11, the y-axis label. first 20 trials"should be"first 20 episodes p. 215. The data shown in Figure 8 1 1 was apparently not generated exactly as described in the text, as its details(but not its overall shape) have defied replication. In particular, several researchers have reported best"steps per episode"in the 200-300 range p. 78. In the 2nd max equation for V*(h), at the end of the first line, "V*(h)"should be"V*() Christian Schulz p. 29. In the upper graph, the third line is unlabeled, but should be labeled"epsilon=0 (greedy) p. 212-213. In these two algorithms, a line is missing that is recommended, though perhaps not required. a next to the last line should be added, just before ending the loop that recomputes a That line would be Q a<.sum iin F a theta(i) p. 127, Figure 5.7. The first two lines of step(c)refer to pairs s, a and times t at or later than time \tau. In fact, it should only treat them for times later than tau, not equal. Thorsten Buchheim) p. 267, Table 11. 1. The number of hidden units for TD-Gammon 3.0 is given as 80, but should be 160(Michael Naish) p. 98, Figure 43. Stuart Reynolds points out that for some MDPs the given policy iteration algorithm never terminates. The problem is that there may be small changes in the values computed in step 2 that cause the policy to forever be changing in step 3. The solution is to e∥/| book/errata.hm(1of2)[2808138203:13:00

Errata and Notes for Sutton & Barto Book: Reinforcement Learning: An Introduction Errata and Notes for: Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto Errata: ● p. xviii, Ben Van Roy should be acknowledged only once in the list. (Ben Van Roy) ● p. 155, the parameter alpha was 0.01, not 0.1 as stated. (Abinav Garg) ● p. 233, last line of caption: "ne-step" should be "one-step". (Michael Naish) ● p. 309, the reference for Tstisiklis and Van Roy (1997b) should be to Technical Report LIDS-P- 2390, Massachusetts Institute of Technology. (Ben Van Roy) ● p. 146, the windy gridworld example may have used alpha=0.5 rather than alpha=0.1 as stated. Can you confirm this? ● p. 322, in the index entry for TD error, the range listed as "174-165" should be "174-175". (Jette Randlov) ● p. 197, bottom formula last theta_t(2) should be theta_t(n). (Dan Bernstein) ● p. 151, second line of the equation, pi(s_t,a_t) should be pi(s_{t+1},a_t). (Dan Bernstein) ● p. 174, 181, 184, 200, 212, 213: in the boxed algorithms on all these pages, the setting of the eligibility traces to zero should appear not in the first line, but as a new first line inside the first loop (just after the "Repeat..."). (Jim Reggia) ● p. 215, Figure 8.11, the y-axis label. "first 20 trials" should be "first 20 episodes". ● p. 215. The data shown in Figure 8.11 was apparently not generated exactly as described in the text, as its details (but not its overall shape) have defied replication. In particular, several researchers have reported best "steps per episode" in the 200-300 range. ● p. 78. In the 2nd max equation for V*(h), at the end of the first line, "V*(h)" should be "V*(l)". (Christian Schulz) ● p. 29. In the upper graph, the third line is unlabeled, but should be labeled "epsilon=0 (greedy)". ● p. 212-213. In these two algorithms, a line is missing that is recommended, though perhaps not required. A next to the last line should be added, just before ending the loop, that recomputes Q_a. That line would be Q_a <- \sum_{i\in F_a} theta(i). ● p. 127, Figure 5.7. The first two lines of step (c) refer to pairs s,a and times t at or later than time \tau. In fact, it should only treat them for times later than \tau, not equal. (Thorsten Buchheim) ● p. 267, Table 11.1. The number of hidden units for TD-Gammon 3.0 is given as 80, but should be 160. (Michael Naish) ● p. 98, Figure 4.3. Stuart Reynolds points out that for some MDPs the given policy iteration algorithm never terminates. The problem is that there may be small changes in the values computed in step 2 that cause the policy to forever be changing in step 3. The solution is to file:///C|/book/errata.html (1 of 2) [28/08/1382 03:13:00 ﾕﾈﾍ]

点击进入文档下载页（PDF格式）

共369页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录