当前位置：和泉文库 > 计算机 > 浏览文档

《人工智能、机器学习与大数据》课程教学资源（参考文献）Robust Frequent Directions with Application in Online Learning

文件格式：PDF，文件大小：968.3KB，售价：11.66元

文档详细内容（约41页）

RFD WITH APPLICATION IN ONLINE LEARNING where R1:T0(w) = T X0 t=1 ft(w(t) ) − T X0 t=1 ft(w) and R(T0+1):T (w) = X T t=T0+1 ft(w(t) ) − T X0 t=T0+1 ft(w). The regret from the first T 0 iterations can be bounded by Theorem 6 and the bound of R(T0+1):T (w) can be derived by the similar proof of Theorem 5. Theorem 6 By the condition of Theorem 5, µ 0 = maxT 0 t=1{µt} and letting α0 = 0, σ ∗ be the smallest nonzero singular values of PT 0 t=1 g (t) (g (t) ) > and T 0 satisfy (12), we have that the first T 0 iterations of Algorithm 3 has the following regret R1:T0(w) ≤ 2(CL) 2 T X0 t=1 ηt + m − 1 2(η1 + µ0) + m(m − 1) 2(η1 + µ0) ln 1 + 2 PT 0 t=1 kg (t)k 2 2 (1 + r)rσ∗ . (13) Combining above results, we can conclude the regret bound for the hyperparameter-free algorithm in Theorem 7. In practice, we often set m to be much smaller than d and T, which implies T 0 is much smaller than T. Hence, the regret bound of Theorem 7 is similar to the one of Theorem 5 when α0 is close to 0. We can use RFD-SON with α0 = 0 and ηt = 1/t to obtain a hyperparameter-free sketched online Newton algorithm. Luo et al. (2016) have proposed a hyperparameter-free online Newton algorithm without sketching and their regret contains a term with coefficient d. Theorem 7 Consider Algorithm 3 with α0 = 0, let T 0 satisfy (12), µ = minT t=1{µt}, µ 0 = maxT 0 t=1{µt}, K = TT t=1Kt , σ ∗ has the same definition of Theorem 6 and α 0 0 = det(H(T 0 ) ). Then under Assumptions 1 and 2 for any w ∈ K, we have that RT (w) ≤ 2(CL) 2X T t=1 ηt + m − 1 2(η1 + µ0) + m(m − 1) 2(η1 + µ0) ln 1 + 2 PT 0 t=1 kg (t)k 2 2 (1 + r)rσ∗ + 1 2 kw(T 0 ) k 2 H(T 0) + m 2(µ + ηT ) ln tr (B(T) ) >B(T) mα(T0) + α (T) α 0 0 + Ω0 RFD (14) where Ω 0 RFD = d − m 2(µ + ηT ) ln α (T) α 0 0 + m 4(µ + ηT ) X T t=T0+1 (σ (t) m ) 2 α(t) + C 2 X T t=T0+1 (σ (t−1) m ) 2 . 6. Experiments In this section, we evaluate the performance of robust frequent directions (RFD) and online Newton step by RFD (RFD-SON) on six real-world data sets “a9a,” “gisette,” “sido0,” “farm-ads,” “rcv1” and “real-sim,” whose details are listed in Table 1. The data sets “sido0” and “farm-ads”can be found 11

LUO.CHEN,ZHANG,LI AND ZHANG on Causality Workbench3,and UCI Machine Learning Repository4.The others can be downloaded from LIBSVM repository.The experiments are conducted in Matlab and run on a server with Intel (R)Core(TM)i7-3770 CPU 3.40GHzx2,8GB RAM and 64-bit Windows Server 2012 system. data sets m d source a9a 32,561 123 (Plat,1999) gisette 6,000 5,000 (Guyon et al.,2004) sido0 12.678 4,932 (Guyon et al.,2008) farm-ads 4,143 54.877 (Mesterharm and Pazzani,2011) rcvl 20,242 47,236 (Lewis et al.,2004) real-sim 72,309 20.958 (McCallum) Table 1:Summary of data sets used in our experiments 6.1.Matrix Approximation We evaluate the approximation errors of the deterministic sketching algorithms including frequent directions (FD)(Liberty,2013;Ghashami et al.,2016),parameterized frequent directions (PFD), compensative frequent directions(CFD)(Desai et al.,2016)and robust frequent directions(RFD) For a given data set A Rmxd of n samples with d features,we use the accelerated algorithms(see details in Appendix A)to approximate the covariance matrix AA by BTB for FD,PFD,CFD and by BB+ala for RFD,respectively.We measure the performance according to the relative spectral norm error.We report the relative spectral norm error by varying the sketch size m. Figure 1 shows the performance of FD,CFD and RFD.These three methods have no extra hyperparameter and their outputs only rely on the sketch size.The relative error of RFD is always smaller than that of FD and CFD.The error of RFD is nearly half of the error of FD in most cases, which matches our theoretical analysis in Theorem 3 very well. Figure 2 compares the performance of RFD and PFD with different choices of the hyperparameter. We use PFD-B to refer the PFD algorithm where Bm singular values will get affected by the shrinkage steps.The extra hyperparameter B is tuned from {0.2,0.4,0.6,0.8}.The result shows that RFD is better than PFD in most cases.PFD sometimes can achieve lower approximation error with a good choice of B.However,selecting the hyperparameter requires additional computation. 6.2.Online Learning We now evaluate the performance of RFD-SON.We use the least squares loss f(w)=(wTx() ()2,and set=w:wTx(1).In the experiments,we use the doubling space strategy (Algorithm 5 in Appendix A).We use 70%of the data set for training and the rest for test.The algorithms in the experiments include ADAGRAD,the standard online Newton step with the full Hessian(Duchi et al.,2011)(FULL-ON),the sketched online Newton step with frequent directions (FD-SON),the parameterized frequent directions(PFD-SON),the random projections(RP-SON), Oja's algorithms (Oja-SON)(Luo et al.,2016;Desai et al.,2016),and our proposed sketched online Newton step with RFD(RFD-SON). 3.https://www.causality.inf.ethz.ch/data/SIDO.html 4.https://archive.ics.uci.edu/ml/datasets/Farm+Ads 5.https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ 12

LUO, CHEN, ZHANG, LI AND ZHANG on Causality Workbench3 , and UCI Machine Learning Repository4 . The others can be downloaded from LIBSVM repository5 . The experiments are conducted in Matlab and run on a server with Intel (R) Core (TM) i7-3770 CPU 3.40GHz×2, 8GB RAM and 64-bit Windows Server 2012 system. data sets n d source a9a 32,561 123 (Platt, 1999) gisette 6,000 5,000 (Guyon et al., 2004) sido0 12,678 4,932 (Guyon et al., 2008) farm-ads 4,143 54,877 (Mesterharm and Pazzani, 2011) rcv1 20,242 47,236 (Lewis et al., 2004) real-sim 72,309 20,958 (McCallum) Table 1: Summary of data sets used in our experiments 6.1. Matrix Approximation We evaluate the approximation errors of the deterministic sketching algorithms including frequent directions (FD) (Liberty, 2013; Ghashami et al., 2016), parameterized frequent directions (PFD), compensative frequent directions (CFD) (Desai et al., 2016) and robust frequent directions (RFD). For a given data set A ∈ R n×d of n samples with d features, we use the accelerated algorithms (see details in Appendix A) to approximate the covariance matrix A>A by B>B for FD, PFD, CFD and by B>B + αId for RFD, respectively. We measure the performance according to the relative spectral norm error. We report the relative spectral norm error by varying the sketch size m. Figure 1 shows the performance of FD, CFD and RFD. These three methods have no extra hyperparameter and their outputs only rely on the sketch size. The relative error of RFD is always smaller than that of FD and CFD. The error of RFD is nearly half of the error of FD in most cases, which matches our theoretical analysis in Theorem 3 very well. Figure 2 compares the performance of RFD and PFD with different choices of the hyperparameter. We use PFD-β to refer the PFD algorithm where bβmc singular values will get affected by the shrinkage steps. The extra hyperparameter β is tuned from {0.2, 0.4, 0.6, 0.8}. The result shows that RFD is better than PFD in most cases. PFD sometimes can achieve lower approximation error with a good choice of β. However, selecting the hyperparameter requires additional computation. 6.2. Online Learning We now evaluate the performance of RFD-SON. We use the least squares loss ft(w) = (w>x (t) − y (t) ) 2 , and set Kt = {w : |w>x (t) | ≤ 1}. In the experiments, we use the doubling space strategy (Algorithm 5 in Appendix A). We use 70% of the data set for training and the rest for test. The algorithms in the experiments include ADAGRAD, the standard online Newton step with the full Hessian (Duchi et al., 2011) (FULL-ON), the sketched online Newton step with frequent directions (FD-SON), the parameterized frequent directions (PFD-SON), the random projections (RP-SON), Oja’s algorithms (Oja-SON) (Luo et al., 2016; Desai et al., 2016), and our proposed sketched online Newton step with RFD (RFD-SON). 3. https://www.causality.inf.ethz.ch/data/SIDO.html 4. https://archive.ics.uci.edu/ml/datasets/Farm+Ads 5. https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 12

RFD WITH APPLICATION IN ONLINE LEARNING 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.02 0.04 0.06 0.08 0.1 0.12 m/d relative error FD CFD RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.5 1 1.5 2 2.5 x 10−3 m/d relative error FD CFD RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 m/d relative error FD CFD RFD (a) a9a (b) gisette (c) sido0 0 0.01 0.02 0.03 0.04 0.05 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 m/d relative error FD CFD RFD 0 0.005 0.01 0.015 0.02 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 m/d relative error FD CFD RFD 0 0.008 0.016 0.024 0.032 0.04 0.048 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 m/d relative error FD CFD RFD (d) farm-ads (e) rcv1 (f) real-sim Figure 1: Comparison of relative spectral error of FD, CFD and RFD with proportion of sketching 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.02 0.04 0.06 0.08 0.1 0.12 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10−3 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.002 0.004 0.006 0.008 0.01 0.012 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD (a) a9a (b) gisette (c) sido0 0 0.01 0.02 0.03 0.04 0.05 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.005 0.01 0.015 0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.008 0.016 0.024 0.032 0.04 0.048 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD (d) farm-ads (e) rcv1 (f) real-sim Figure 2: Comparison of relative spectral error of PFD and RFD with proportion of sketching The hyperparameter α0 is tuned from {10−3 , 10−2 . . . 105 , 106} for all methods and let η = 1/t for SON algorithms. FULL-ON is too expensive and impractical for large d, so we exclude it from experiments on “farm-ads,” “rcv1” and “real-sim.” For PFD-SON, we let β = 0.2 heuristically because it usually achieves good performance on approximating the covariance matrix. Additionally, RFD-SON includes the result with α0 = 0 (RFD0-SON). The sketch size of sketched online Newton methods is chosen from {5, 10, 20} for “a9a,” “gisette,” “sido0,” and {20, 30, 50} for “farm-ads,” “rcv1” and “real-sim.” We measure performance according to two metrics (Duchi et al., 2011): the 13

LUO, CHEN, ZHANG, LI AND ZHANG online error rate and the test set performance of the predictor at the end of one pass through the training data. We are interested in how the hyperparameter α0 affects the performance of the algorithms. We display the test set performance in Figures 3 and 4. We compare the online error rate of RFD0-SON with the one of FULL-ON in Figure 5 and show the comparison between RFD0-SON and other SON methods with different choices of α0 in Figures 6 - 11. We also report the accuracy on the test sets for all algorithms at one pass with the best α0 in Table 2 and the corresponding running times in Table 3. All SON algorithms can perform well with the best choice of α0. However, only RFD0-SON can perform well without tuning the hyperparameter while all baseline methods ADAGRAD, FD-SON, PFD-SON, RP-SON and Oja-SON are very sensitive to the value of α0. The sub-figure (j)-(l) in Figures 5-11 shows RFD-SON usually has good performance with small α0, which validates our theoretical analysis in Theorem 5. The choice of the hyperparameter almost has no effect of RFD-SON on data set “a9a’,’ “gisette,” “sido0” and “farm-ads.” These results verify that RFD-SON is a very stable algorithm in practice. 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON (a) a9a, m = 5 (b) a9a, m = 10 (c) a9a, m = 20 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON (d) gisette, m = 5 (e) gisette, m = 10 (f) gisette, m = 20 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON (g) sido0, m = 5 (h) sido0, m = 10 (i) sido0, m = 20 Figure 3: Comparison of the test error at the end of one pass on “a9a”, “gisette”, “sido0” 14

点击进入文档下载页（PDF格式）

共41页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录