RFD WITH APPLICATION IN ONLINE LEARNING where T T R:r(w)=∑f(w)-∑f(w) t=1 t=1 and T R(T+1:T(w)= f(w©)- ft(w). t=T'+1 t=T'+1 The regret from the first T'iterations can be bounded by Theorem 6 and the bound of R(T+):T(w) can be derived by the similar proof of Theorem 5. Theorem6 By the condition of Theorem5,u=maxand letting o=be the smallest nonero singular values ofand Tsatisfy(1),we have that the first Titerations of Algorithm 3 has the following regret T m-1,m(m-1) w≤2CLU% n1+ 2∑g9喔 201++2m+ (13) (1+r)ro* Combining above results,we can conclude the regret bound for the hyperparameter-free algorithm in Theorem 7.In practice,we often set m to be much smaller than d and T,which implies T is much smaller than T.Hence,the regret bound of Theorem 7 is similar to the one of Theorem 5 when ao is close to 0.We can use RFD-SON with ao =0 and nt =1/t to obtain a hyperparameter-free sketched online Newton algorithm.Luo et al.(2016)have proposed a hyperparameter-free online Newton algorithm without sketching and their regret contains a term with coefficient d. Theorem 7 Consider Algorithm 3 with ao 0,let T'satisfy (12),u =minfut,u'= maxtu).K=K.a'has the same definition of Theorem 6 and a=det(H(T)).Then under Assumptions I and 2 for any w EK,we have that Rm)≤2CD+ T +nmn+2g9色 m+0+2m+ (1+r)ro* mh(rB四)rB四+am +wT+2+7 ma(T) +RFD (14) where a() d-m In T m D=24+mn6+4u+m (a)2 a+2∑a-y t=T+1 t=T'+1 6.Experiments In this section,we evaluate the performance of robust frequent directions(RFD)and online Newton step by RFD(RFD-SON)on six real-world data sets“a9a,”“gisette,”“sido0,”“farm-ads,”“rcvl' and "real-sim,"whose details are listed in Table 1.The data sets "sidoo"and"farm-ads"can be found 11
RFD WITH APPLICATION IN ONLINE LEARNING where R1:T0(w) = T X0 t=1 ft(w(t) ) − T X0 t=1 ft(w) and R(T0+1):T (w) = X T t=T0+1 ft(w(t) ) − T X0 t=T0+1 ft(w). The regret from the first T 0 iterations can be bounded by Theorem 6 and the bound of R(T0+1):T (w) can be derived by the similar proof of Theorem 5. Theorem 6 By the condition of Theorem 5, µ 0 = maxT 0 t=1{µt} and letting α0 = 0, σ ∗ be the smallest nonzero singular values of PT 0 t=1 g (t) (g (t) ) > and T 0 satisfy (12), we have that the first T 0 iterations of Algorithm 3 has the following regret R1:T0(w) ≤ 2(CL) 2 T X0 t=1 ηt + m − 1 2(η1 + µ0) + m(m − 1) 2(η1 + µ0) ln 1 + 2 PT 0 t=1 kg (t)k 2 2 (1 + r)rσ∗ . (13) Combining above results, we can conclude the regret bound for the hyperparameter-free algorithm in Theorem 7. In practice, we often set m to be much smaller than d and T, which implies T 0 is much smaller than T. Hence, the regret bound of Theorem 7 is similar to the one of Theorem 5 when α0 is close to 0. We can use RFD-SON with α0 = 0 and ηt = 1/t to obtain a hyperparameter-free sketched online Newton algorithm. Luo et al. (2016) have proposed a hyperparameter-free online Newton algorithm without sketching and their regret contains a term with coefficient d. Theorem 7 Consider Algorithm 3 with α0 = 0, let T 0 satisfy (12), µ = minT t=1{µt}, µ 0 = maxT 0 t=1{µt}, K = TT t=1Kt , σ ∗ has the same definition of Theorem 6 and α 0 0 = det(H(T 0 ) ). Then under Assumptions 1 and 2 for any w ∈ K, we have that RT (w) ≤ 2(CL) 2X T t=1 ηt + m − 1 2(η1 + µ0) + m(m − 1) 2(η1 + µ0) ln 1 + 2 PT 0 t=1 kg (t)k 2 2 (1 + r)rσ∗ + 1 2 kw(T 0 ) k 2 H(T 0) + m 2(µ + ηT ) ln tr (B(T) ) >B(T) mα(T0) + α (T) α 0 0 + Ω0 RFD (14) where Ω 0 RFD = d − m 2(µ + ηT ) ln α (T) α 0 0 + m 4(µ + ηT ) X T t=T0+1 (σ (t) m ) 2 α(t) + C 2 X T t=T0+1 (σ (t−1) m ) 2 . 6. Experiments In this section, we evaluate the performance of robust frequent directions (RFD) and online Newton step by RFD (RFD-SON) on six real-world data sets “a9a,” “gisette,” “sido0,” “farm-ads,” “rcv1” and “real-sim,” whose details are listed in Table 1. The data sets “sido0” and “farm-ads”can be found 11
LUO.CHEN,ZHANG,LI AND ZHANG on Causality Workbench3,and UCI Machine Learning Repository4.The others can be downloaded from LIBSVM repository.The experiments are conducted in Matlab and run on a server with Intel (R)Core(TM)i7-3770 CPU 3.40GHzx2,8GB RAM and 64-bit Windows Server 2012 system. data sets m d source a9a 32,561 123 (Plat,1999) gisette 6,000 5,000 (Guyon et al.,2004) sido0 12.678 4,932 (Guyon et al.,2008) farm-ads 4,143 54.877 (Mesterharm and Pazzani,2011) rcvl 20,242 47,236 (Lewis et al.,2004) real-sim 72,309 20.958 (McCallum) Table 1:Summary of data sets used in our experiments 6.1.Matrix Approximation We evaluate the approximation errors of the deterministic sketching algorithms including frequent directions (FD)(Liberty,2013;Ghashami et al.,2016),parameterized frequent directions (PFD), compensative frequent directions(CFD)(Desai et al.,2016)and robust frequent directions(RFD) For a given data set A Rmxd of n samples with d features,we use the accelerated algorithms(see details in Appendix A)to approximate the covariance matrix AA by BTB for FD,PFD,CFD and by BB+ala for RFD,respectively.We measure the performance according to the relative spectral norm error.We report the relative spectral norm error by varying the sketch size m. Figure 1 shows the performance of FD,CFD and RFD.These three methods have no extra hyperparameter and their outputs only rely on the sketch size.The relative error of RFD is always smaller than that of FD and CFD.The error of RFD is nearly half of the error of FD in most cases, which matches our theoretical analysis in Theorem 3 very well. Figure 2 compares the performance of RFD and PFD with different choices of the hyperparameter. We use PFD-B to refer the PFD algorithm where Bm singular values will get affected by the shrinkage steps.The extra hyperparameter B is tuned from {0.2,0.4,0.6,0.8}.The result shows that RFD is better than PFD in most cases.PFD sometimes can achieve lower approximation error with a good choice of B.However,selecting the hyperparameter requires additional computation. 6.2.Online Learning We now evaluate the performance of RFD-SON.We use the least squares loss f(w)=(wTx() ()2,and set=w:wTx(1).In the experiments,we use the doubling space strategy (Algorithm 5 in Appendix A).We use 70%of the data set for training and the rest for test.The algorithms in the experiments include ADAGRAD,the standard online Newton step with the full Hessian(Duchi et al.,2011)(FULL-ON),the sketched online Newton step with frequent directions (FD-SON),the parameterized frequent directions(PFD-SON),the random projections(RP-SON), Oja's algorithms (Oja-SON)(Luo et al.,2016;Desai et al.,2016),and our proposed sketched online Newton step with RFD(RFD-SON). 3.https://www.causality.inf.ethz.ch/data/SIDO.html 4.https://archive.ics.uci.edu/ml/datasets/Farm+Ads 5.https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/ 12
LUO, CHEN, ZHANG, LI AND ZHANG on Causality Workbench3 , and UCI Machine Learning Repository4 . The others can be downloaded from LIBSVM repository5 . The experiments are conducted in Matlab and run on a server with Intel (R) Core (TM) i7-3770 CPU 3.40GHz×2, 8GB RAM and 64-bit Windows Server 2012 system. data sets n d source a9a 32,561 123 (Platt, 1999) gisette 6,000 5,000 (Guyon et al., 2004) sido0 12,678 4,932 (Guyon et al., 2008) farm-ads 4,143 54,877 (Mesterharm and Pazzani, 2011) rcv1 20,242 47,236 (Lewis et al., 2004) real-sim 72,309 20,958 (McCallum) Table 1: Summary of data sets used in our experiments 6.1. Matrix Approximation We evaluate the approximation errors of the deterministic sketching algorithms including frequent directions (FD) (Liberty, 2013; Ghashami et al., 2016), parameterized frequent directions (PFD), compensative frequent directions (CFD) (Desai et al., 2016) and robust frequent directions (RFD). For a given data set A ∈ R n×d of n samples with d features, we use the accelerated algorithms (see details in Appendix A) to approximate the covariance matrix A>A by B>B for FD, PFD, CFD and by B>B + αId for RFD, respectively. We measure the performance according to the relative spectral norm error. We report the relative spectral norm error by varying the sketch size m. Figure 1 shows the performance of FD, CFD and RFD. These three methods have no extra hyperparameter and their outputs only rely on the sketch size. The relative error of RFD is always smaller than that of FD and CFD. The error of RFD is nearly half of the error of FD in most cases, which matches our theoretical analysis in Theorem 3 very well. Figure 2 compares the performance of RFD and PFD with different choices of the hyperparameter. We use PFD-β to refer the PFD algorithm where bβmc singular values will get affected by the shrinkage steps. The extra hyperparameter β is tuned from {0.2, 0.4, 0.6, 0.8}. The result shows that RFD is better than PFD in most cases. PFD sometimes can achieve lower approximation error with a good choice of β. However, selecting the hyperparameter requires additional computation. 6.2. Online Learning We now evaluate the performance of RFD-SON. We use the least squares loss ft(w) = (w>x (t) − y (t) ) 2 , and set Kt = {w : |w>x (t) | ≤ 1}. In the experiments, we use the doubling space strategy (Algorithm 5 in Appendix A). We use 70% of the data set for training and the rest for test. The algorithms in the experiments include ADAGRAD, the standard online Newton step with the full Hessian (Duchi et al., 2011) (FULL-ON), the sketched online Newton step with frequent directions (FD-SON), the parameterized frequent directions (PFD-SON), the random projections (RP-SON), Oja’s algorithms (Oja-SON) (Luo et al., 2016; Desai et al., 2016), and our proposed sketched online Newton step with RFD (RFD-SON). 3. https://www.causality.inf.ethz.ch/data/SIDO.html 4. https://archive.ics.uci.edu/ml/datasets/Farm+Ads 5. https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ 12
RFD WITH APPLICATION IN ONLINE LEARNING 10 -FD 0.01 .012 21. . 0508 (a)a9a (b)gisette (c)sidoo Q04 035 03 -RFD -RFD a0s a.oe 025 00 0.01002 0.04 00 0606 001602 0032 004 004 (d)farm-ads (e)rcv1 (f)real-sim Figure 1:Comparison of relative spectral error of FD,CFD and RFD with proportion of sketching BED-0 RFD 0.0 0.1 02 05 0.8 00 0.18 (a)a9a (b)gisette (c)sido0 FD-0.2 D-0 RFD 02 001 Q04 0005 015 002 0.08 0.016 0.0.04 (d)farm-ads (e)revl (f)real-sim Figure 2:Comparison of relative spectral error of PFD and RFD with proportion of sketching The hyperparameter ao is tuned from{10-3,10-2...105,106}for all methods and let=1/t for SON algorithms.FULL-ON is too expensive and impractical for large d,so we exclude it from experiments on“farm-ads,“rcvl”and“real-sim.”For PFD-SON,we let B=O.2 heuristically because it usually achieves good performance on approximating the covariance matrix.Additionally, RFD-SON includes the result with ao =0(RFDo-SON).The sketch size of sketched online Newton methods is chosen from{5,l0,20}for“a9a,”“gisette,”“sido0,”and{20,30,50}for“farm-ads,” "rcv1"and"real-sim."We measure performance according to two metrics (Duchi et al.,2011):the 13
RFD WITH APPLICATION IN ONLINE LEARNING 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.02 0.04 0.06 0.08 0.1 0.12 m/d relative error FD CFD RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.5 1 1.5 2 2.5 x 10−3 m/d relative error FD CFD RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 m/d relative error FD CFD RFD (a) a9a (b) gisette (c) sido0 0 0.01 0.02 0.03 0.04 0.05 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 0.045 m/d relative error FD CFD RFD 0 0.005 0.01 0.015 0.02 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 m/d relative error FD CFD RFD 0 0.008 0.016 0.024 0.032 0.04 0.048 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 m/d relative error FD CFD RFD (d) farm-ads (e) rcv1 (f) real-sim Figure 1: Comparison of relative spectral error of FD, CFD and RFD with proportion of sketching 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 0.02 0.04 0.06 0.08 0.1 0.12 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 x 10−3 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.03 0.06 0.09 0.12 0.15 0.18 0.21 0 0.002 0.004 0.006 0.008 0.01 0.012 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD (a) a9a (b) gisette (c) sido0 0 0.01 0.02 0.03 0.04 0.05 0 0.005 0.01 0.015 0.02 0.025 0.03 0.035 0.04 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.005 0.01 0.015 0.02 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD 0 0.008 0.016 0.024 0.032 0.04 0.048 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 m/d relative error PFD−0.2 PFD−0.4 PFD−0.6 PFD−0.8 RFD (d) farm-ads (e) rcv1 (f) real-sim Figure 2: Comparison of relative spectral error of PFD and RFD with proportion of sketching The hyperparameter α0 is tuned from {10−3 , 10−2 . . . 105 , 106} for all methods and let η = 1/t for SON algorithms. FULL-ON is too expensive and impractical for large d, so we exclude it from experiments on “farm-ads,” “rcv1” and “real-sim.” For PFD-SON, we let β = 0.2 heuristically because it usually achieves good performance on approximating the covariance matrix. Additionally, RFD-SON includes the result with α0 = 0 (RFD0-SON). The sketch size of sketched online Newton methods is chosen from {5, 10, 20} for “a9a,” “gisette,” “sido0,” and {20, 30, 50} for “farm-ads,” “rcv1” and “real-sim.” We measure performance according to two metrics (Duchi et al., 2011): the 13
LUO.CHEN,ZHANG,LI AND ZHANG online error rate and the test set performance of the predictor at the end of one pass through the training data. We are interested in how the hyperparameter oo affects the performance of the algorithms.We display the test set performance in Figures 3 and 4.We compare the online error rate of RFDo-SON with the one of FULL-ON in Figure 5 and show the comparison between RFDo-SON and other SON methods with different choices of ao in Figures 6-11. We also report the accuracy on the test sets for all algorithms at one pass with the best ao in Table 2 and the corresponding running times in Table 3.All SON algorithms can perform well with the best choice of ao.However,only RFDo-SON can perform well without tuning the hyperparameter while all baseline methods ADAGRAD,FD-SON,PFD-SON,RP-SON and Oja-SON are very sensitive to the value of ao.The sub-figure (j)-(1)in Figures 5-11 shows RFD-SON usually has good performance with small oo,which validates our theoretical analysis in Theorem 5.The choice of the hyperparameter almost has no effect of RFD-SON on data set“a9a',“gisette,”“sido0'and "farm-ads."These results verify that RFD-SON is a very stable algorithm in practice 0.9 0. 0.8 10 8 (a)a9a,m=5 (b)a9a,m=10 (c)a9a,m=20 0.9 0.8 0. 3 0 0.5 0.4 02 10 8 (d)gisette,m =5 (e)gisette,m =10 (f)gisette,m =20 0.9 05 0.5 0.4 02 00 8 (g)sido0,m =5 (h)sido0,m =10 (i①sido0,m=20 Figure3:Comparison of the test error at the end of one pass on“a9a”,“gisette”,“sido0 14
LUO, CHEN, ZHANG, LI AND ZHANG online error rate and the test set performance of the predictor at the end of one pass through the training data. We are interested in how the hyperparameter α0 affects the performance of the algorithms. We display the test set performance in Figures 3 and 4. We compare the online error rate of RFD0-SON with the one of FULL-ON in Figure 5 and show the comparison between RFD0-SON and other SON methods with different choices of α0 in Figures 6 - 11. We also report the accuracy on the test sets for all algorithms at one pass with the best α0 in Table 2 and the corresponding running times in Table 3. All SON algorithms can perform well with the best choice of α0. However, only RFD0-SON can perform well without tuning the hyperparameter while all baseline methods ADAGRAD, FD-SON, PFD-SON, RP-SON and Oja-SON are very sensitive to the value of α0. The sub-figure (j)-(l) in Figures 5-11 shows RFD-SON usually has good performance with small α0, which validates our theoretical analysis in Theorem 5. The choice of the hyperparameter almost has no effect of RFD-SON on data set “a9a’,’ “gisette,” “sido0” and “farm-ads.” These results verify that RFD-SON is a very stable algorithm in practice. 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON (a) a9a, m = 5 (b) a9a, m = 10 (c) a9a, m = 20 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON (d) gisette, m = 5 (e) gisette, m = 10 (f) gisette, m = 20 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON 100 105 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 α0 accuracy ADAGRAD FD−SON PFD−SON Oja−SON RP−SON RFD−SON FULL−ON RFD0−SON (g) sido0, m = 5 (h) sido0, m = 10 (i) sido0, m = 20 Figure 3: Comparison of the test error at the end of one pass on “a9a”, “gisette”, “sido0” 14