Review:Gradient descent Gradient:Loss的等高線的法線方向 L(00) Start at position 00 80 L(0) Compute gradient at 00 01 ↑7L(02) Move to 01=00-nVL(00) ◆ Gradient 82 Compute gradient at 01 →Movement r.( 83 Move to 02=01-nVL(01) ●: 01 -Machine Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html
Review: Gradient Descent Start at position �0 Compute gradient at �0 Move to �1 = �0 - η�� �0 Compute gradient at �1 Move to �2 = �1 – η�� �1 Movement Gradient …… �0 �1 �2 �3 �� �0 �� �1 �� �2 �� �3 �1 �2 Gradient: Loss 的等高線的法線方向 – Machine Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html
线性回归-梯度下降 m J(o,1..,0n)=∑(ho(c0,x1,.cn)-i)2 ha(c0,x1,.xn)=∑0x i=0 i=0 算法过程: 1.确定当前位置的损失函数的梯度,对于日,其梯度表达式如下: 是J,,n)=六(h(,,…z以)-6z =0 2.用步长(学习率)乘以损失函数的梯度,得到当前位置下降的距离,即 a需(,1,n)对应于前面图中例子中的某一步。 3.确定是否所有的日,梯度下降的距离都小于ε,如果小于ε则算法终止, 当前所有的0(i=0,1.n)即为最终结果。否则进入步骤4. 4.更新所有的0,对于0,其更新表达式如下。更新完毕后继续转入步骤1. m =0:-a0J(,1…,n)=h-a六(h(,,…i)-)z =0
/!H :3D9 1. 7#?6&'6/! θi /!@C" 2.51G( 4)&'6/!$#?H6A8 I 6.1 3.7)%,6θi , /!H6A8F ε- ε :3;0 #%,6θi (i=0,1,...n)+;<-E1J4. 4.*(%,6θ θi *(@C"*(2=>B1J1.
0=01-7NL(0-) Learning Rate Set the learning rate n carefully If there are more than three Loss parameters,you cannot visualize this. Very Large small Large Loss Just make No.of parameters updates But you can always visualize this. -Machine Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html
Learning Rate No. of parameters updates Loss Loss Very Large Large small Just make 1 1 i i i T T K L T Set the learning rate η carefully If there are more than three parameters, you cannot visualize this. But you can always visualize this. – Machine Learning http://speech.ee.ntu.edu.tw/~tlkagk/courses_ML17_2.html
其它优化方法 Momentum Nesterov Adagrad Adadelta-Adagrad扩展 RMSprop-Adadelta特例 Adam-带Momentum的RMSprop(最多) Adamax-Adam变体 Nadam-带Nesterov动量项的Adam ●●●●●● AMSGRAD 2018 ICLR Best paper
Momentum Nesterov Adagrad Adadelta – Adagrad RMSprop – Adadelta Adam – MomentumRMSprop() Adamax – Adam Nadam – NesterovAdam AMSGRAD – 2018 ICLR Best paper
Momentum,Nesterov Momentum update Nesterov momentum update "lookahead"gradient step (bit different than momentum momentum original) step step actual step actual step gradient step t时刻的下降方向,不仅由当前点的 t的主要下降方向是由累积动量决定的, 梯度方向决定,而且由此前累积的下 自己的梯度方向说了也不算,那与其看 降方向决定。 当前梯度方向,不如先看看如果跟着累 B1的经验值为0.9,这就意味着下降 积动量走了一步,那个时候再怎么走。 方向主要是此前累积的下降方向,并 因此,NAG计算如果按照累积动量走了 略微偏向当前时刻的下降方向。想象 步,那个时候的下降方向 高速公路上汽车转弯,在高速向前的 同时略微偏向,急转弯可是要出事的
Momentum, Nesterov t 2=S1;*9= 5(1$D;6B@= S1$ β1=CT 0.9O%/ ?S 1F36B@=S1' <+*2=S1.I UPL8MN)"UP= 2<+-N)3F= t =FS13;B@R$= E&=5(1H AQ> *5(1#>>#4K?B @RJ7Q 2,J !6NAGGA#40:B@RJ 7Q 2=S1