BASGD:Buffered Asynchronous SGD for Byzantine Learning We first make the following assumptions,which have been an extra constant variance.The term O(rDdo(g-r+ widely used in stochastic optimization. 1))is caused by the aggregation function,which can be Assumption 1(Lower bound).Global loss function F(w) deemed as a sacrifice for Byzantine resilience.The term is bounded below:3F*∈R,F(w)≥F*,w∈Rd. O(rDdk(g-r +1))is caused by the differences of Assumption 2(Bounded bias).For any loyal worker,it can training instances among different workers.In independent and identically distributed (i.i.d.)cases,=0 and the term use locally stored training instances to estimate global gra- dient with bounded bias K:lEVf(w;z)】-7F(w)l≤ vanishes.The term O(rLDDdTmaz(q-r+1))is k,w∈Rd. caused by the delay,and related to parameter Tmaz.The term is also related to the buffer size.When N increases, Assumption 3(Bounded gradient).VF(w)is bounded: N()may increase,and thus D will decrease.Namely,larger 3D∈R+,IVF(w)l≤D,w∈R4. buffer size will result in smaller D.Besides,the factor Assumption 4 (Bounded variance).ElVf(w;zi)- (q-r+1)-or (g-r+1)decreases as g increases, Vf(w;z)|w]l|w]≤a2,w∈R4 and increases as r increases. Assumption 5(L-smoothness).Global loss function F(w) Although general,the bound presented in Theorem 1 is is differentiable and L-smooth:F(w)-VF(w'< relatively loose in high-dimensional cases,since d appears Lw-wll,w,w∈Rd. in all the three extra terms.To obtain a tighter bound.we introduce Theorem 2 for BASGD with(A1,A2)-effective Let N(t)be the (g+1)-th smallest value in {N}bE[B). aggregation function(Definition 5). where N is the number of gradients stored in buffer b at the t-th iteration.We define the constant Theorem 2.If the total number of heavily delayed workers and Byzantine workers is not larger than r,B =O(r). (B-r)√B-r+1 AB.q.r= and Aggr()is an(A1,A2)-effective aggregation function V(B-q-1)(g-r+1) in this case.With learning rate n-O().in general asynchronous cases we have: which will appear in Lemma I and Lemma 2. Lemma 1.If Aggr()is q-BR,and there are at most r ∑EVF(wI3] L[F(w0)-F* Byzantine workers (r <q),we have: T EGl21w1≤AB.gd·(D2+o2/N) +O Li Tmar DA2r +O L2(A2)2 T吃 T Lemma 2.If Aggr()is q-BR,and the total number of heavily delayed workers and Byzantine workers is not larger L&(42)2r品aL thanr(r≤q,we have:. +A1 T是 E[G-VF(w)w AB..d(TmazL.[AB..rd(D2+02/N(++). Theorem 2 indicates that if Aggr()makes a synchronous BL method converge (i.e.,satisfies Definition 5),BASGD Theorem1.LetD=∑(D2+a2/N)克.f converges when using Aggr(.)as aggregation function. Aggr()is q-BR,B=O(r),and the total number of heav- Hence,BASGD can also be seen as a technique of asyn- ily delayed workers and Byzantine workers is not larger chronization.That is to say,new asynchronous meth- thanr(),with learning rate(),we have: ods can be obtained from synchronous ones when using BASGD.The extra constant term A1 is caused by gradi- ∑TVF(wI≤O (L[F(w)-F*] ent bias.When there is no Byzantine workers(r=0) and instances are i.i.d.across workers,letting B=1 and T Aggr(h1,·,hB)=Aggr(h1)=h1,BASGD degener- rdD rDdo ates to vanilla ASGD.In this case,there is no gradient bias +O T(g-r+1) +0 (q-r+1) (A1 =0),and BASGD has a convergence rate of O(1/VT), which is the same as that of vanilla ASGD (Liu Zhang, rDdk r是LDDd造Tma 2021). +0 (g-r+1) (g-r+1) Meanwhile,it remains uncertain whether the dependence to the staleness parameter Tmaz is tight enough.Theo- Please note that the convergence rate of vanilla ASGD is rem 2 illustrates that BASGD has a convergence rate of O(T).Hence,Theorem 1 indicates that BASGD has a O(Tmaz/T),while the convergence rate of vanilla ASGD theoretical convergence rate as fast as vanilla ASGD,with can reach O(Tma/T).To the best of our knowledge,there
BASGD: Buffered Asynchronous SGD for Byzantine Learning We first make the following assumptions, which have been widely used in stochastic optimization. Assumption 1 (Lower bound). Global loss function F(w) is bounded below: ∃F ∗ ∈ R, F(w) ≥ F ∗ , ∀w ∈ R d . Assumption 2 (Bounded bias). For any loyal worker, it can use locally stored training instances to estimate global gradient with bounded bias κ: kE[∇f(w; zi)] − ∇F(w)k ≤ κ, ∀w ∈ R d . Assumption 3 (Bounded gradient). ∇F(w) is bounded: ∃D ∈ R +, k∇F(w)k ≤ D, ∀w ∈ R d . Assumption 4 (Bounded variance). E[||∇f(w; zi) − E[∇f(w; zi) | w]||2 | w] ≤ σ 2 , ∀w ∈ R d . Assumption 5 (L-smoothness). Global loss function F(w) is differentiable and L-smooth: ||∇F(w) − ∇F(w0 )|| ≤ L||w − w0 ||, ∀w, w0 ∈ R d . Let N(t) be the (q + 1)-th smallest value in {Nt b }b∈[B] , where Nt b is the number of gradients stored in buffer b at the t-th iteration. We define the constant ΛB,q,r = (B − r) √ B − r + 1 p (B − q − 1)(q − r + 1) , which will appear in Lemma 1 and Lemma 2. Lemma 1. If Aggr(·) is q-BR, and there are at most r Byzantine workers (r ≤ q), we have: E[||Gt ||2 | wt ] ≤ ΛB,q,rd · (D2 + σ 2 /N(t) ). Lemma 2. If Aggr(·) is q-BR, and the total number of heavily delayed workers and Byzantine workers is not larger than r (r ≤ q), we have: ||E[Gt − ∇F(wt ) | wt ]|| ≤ΛB,q,rd(τmaxL · [ΛB,q,rd(D2 + σ 2 /N(t) )] 1 2 + σ + κ). Theorem 1. Let D˜ = 1 T PT −1 t=0 (D2 + σ 2/N(t) ) 1 2 . If Aggr(·) is q-BR, B = O(r), and the total number of heavily delayed workers and Byzantine workers is not larger than r (r ≤ q), with learning rate η = O( 1 L √ T ), we have: PT −1 t=0 E[||∇F(wt )||2 ] T ≤ O L[F(w0 ) − F ∗ ] T 1 2 + O rdD˜ T 1 2 (q − r + 1) 1 2 ! + O rDdσ (q − r + 1) 1 2 + O rDdκ (q − r + 1) 1 2 + O r 3 2 LDDd˜ 3 2 τmax (q − r + 1) 3 4 ! . Please note that the convergence rate of vanilla ASGD is O(T − 1 2 ). Hence, Theorem 1 indicates that BASGD has a theoretical convergence rate as fast as vanilla ASGD, with an extra constant variance. The term O(rDdσ(q − r + 1)− 1 2 ) is caused by the aggregation function, which can be deemed as a sacrifice for Byzantine resilience. The term O(rDdκ(q − r + 1)− 1 2 ) is caused by the differences of training instances among different workers. In independent and identically distributed (i.i.d.) cases, κ = 0 and the term vanishes. The term O(r 3 2 LDDd˜ 3 2 τmax(q − r + 1)− 3 4 ) is caused by the delay, and related to parameter τmax. The term is also related to the buffer size. When Nt b increases, N(t) may increase, and thus D˜ will decrease. Namely, larger buffer size will result in smaller D˜. Besides, the factor (q − r + 1)− 1 2 or (q − r + 1)− 3 4 decreases as q increases, and increases as r increases. Although general, the bound presented in Theorem 1 is relatively loose in high-dimensional cases, since d appears in all the three extra terms. To obtain a tighter bound, we introduce Theorem 2 for BASGD with (A1, A2)-effective aggregation function (Definition 5). Theorem 2. If the total number of heavily delayed workers and Byzantine workers is not larger than r, B = O(r), and Aggr(·) is an (A1, A2)-effective aggregation function in this case. With learning rate η = O( √ 1 LT ), in general asynchronous cases we have: PT −1 t=0 E[k∇F(wt )k 2 ] T ≤ O L 1 2 [F(w0 ) − F ∗ ] T 1 2 ! + O L 1 2 τmaxDA2r 1 2 T 1 2 ! + O L 1 2 (A2) 2 T 1 2 ! + O L 5 2 (A2) 2 τ 2 maxr T 3 2 ! + A1. Theorem 2 indicates that if Aggr(·) makes a synchronous BL method converge (i.e., satisfies Definition 5), BASGD converges when using Aggr(·) as aggregation function. Hence, BASGD can also be seen as a technique of asynchronization. That is to say, new asynchronous methods can be obtained from synchronous ones when using BASGD. The extra constant term A1 is caused by gradient bias. When there is no Byzantine workers (r = 0), and instances are i.i.d. across workers, letting B = 1 and Aggr(h1, . . . , hB) = Aggr(h1) = h1, BASGD degenerates to vanilla ASGD. In this case, there is no gradient bias (A1 = 0), and BASGD has a convergence rate of O(1/ √ T), which is the same as that of vanilla ASGD (Liu & Zhang, 2021). Meanwhile, it remains uncertain whether the dependence to the staleness parameter τmax is tight enough. Theorem 2 illustrates that BASGD has a convergence rate of O(τmax/T 1 2 ), while the convergence rate of vanilla ASGD can reach O(τmax/T). To the best of our knowledge, there
BASGD:Buffered Asynchronous SGD for Byzantine Learning exist few works revealing the tightness of Tmaz in asyn- failure with expectation 0.Besides,each worker is manually chronous BL.and we will leave this for future work. set to have a delay,which is kdet times the computing time. In general cases,Theorem 2 guarantees BASGD to find Training set is randomly and equally distributed to different a point such that the squared L2-norm of its gradient is workers.We use the average top-1 test accuracy (in IC)or not larger than A1(but not necessarily around a stationary average perplexity (in NLP)on all workers w.r.t.epochs as point),in expectation.Please note that Assumption 3 already final metrics.For BASGD,we use median and trimmed- guarantees that gradient's squared L2-norm is not larger mean as aggregation function.Reassignment interval is set than D2.We introduce Proposition 2 to show that Ai is to be 1 second.Top-1 test accuracy (in IC)w.r.t.wall-clock guaranteed to be smaller than D2 under a mild condition. time of BASGD with more aggregation functions is reported in Appendix C of supplementary material. Proposition 2.Aggr()is an (A1,A2)-effective aggrega- Because BASGD is an ABL method.SBL methods can- tion function,and Gun is aggregated by Aggr()in syn- chronous setting.If ElllGsun -VF(wt)l w]D, not be directly compared with BASGD.The ABL method Zeno++either cannot be directly compared with BASGD, Vwt∈Rd,then we have A1≤D2 because Zeno++needs to store some instances on server The number of instances stored on server will affect the Gis the aggregated result of Aggr(),and is a ro- performance of Zeno++(Xie et al.,2020).Hence,we com- bust estimator of VF(w)used for updating.Since pare BASGD with ASGD and Kardam in our experiments. VF(w)<D,VF(w)locates in a ball with radius D.EllGm -VF(w)w]D means that the bias We set dampening function A()-for Kardam as suggested in (Damaskinos et al.,2018). of G is not larger than the radius D,which is a mild condition for Aggr(). 5.2.Image Classification Experiment As many existing works have shown (Assran et al.,2020: Nokleby et al.,2020),speed-up is also an important aspect In IC experiment,algorithms are evaluated on CIFAR- of distributed learning methods.In BASGD,different work- 10(Krizhevsky et al.,2009)with deep learning model ResNet-20(He et al.,2016).Cross-entropy is used as ers can compute gradients concurrently,make each buffer be filled more quickly,and thus speed up the model updating. the loss function.We set katk =10 for NG-attack,and However,we mainly focus on Byzantine-resilience in this atk=0.2 for RD-attack.kdet is randomly sampled from truncated standard normal distribution within 0.+oo).As work.Speed-up will be thoroughly studied in future work. suggested in (He et al.,2016),learning rate n is set to 0.1 initially for each algorithm,and multiplied by 0.1 at the 5.Experiment 80-th epoch and the 120-th epoch respectively.The weight decay is set to 10-4.We run each algorithm for 160 epochs. In this section,we empirically evaluate the performance of Batch size is set to 25. BASGD and baselines in both image classification (IC)and natural language processing (NLP)applications.Our exper- Firstly,we compare the performance of different methods iments are conducted on a distributed platform with dock- when there are no Byzantine workers.Experimental results ers.Each docker is bound to an NVIDIA Tesla V100(32G) with median and trmean aggregation functions are illustrated GPU (in IC)or an NVIDIA Tesla K80 GPU (in NLP).Please in Figure 3(a)and Figure 3(d),respectively.ASGD achieves note that different GPU cards do not affect the reported met- the best performance.BASGD(B 1)and Kardam have rics in the experiment.We choose 30 dockers as workers similar convergence rate to ASGD,but both sacrifice a little in IC,and 8 dockers in NLP.An extra docker is chosen as accuracy.Besides,the performance of BASGD gets worse server.All algorithms are implemented with PyTorch 1.3. when the buffer number B increases,which is consistent with the theoretical results.Please note that ASGD is a 5.1.Experimental Setting degenerated case of BASGD when B=1 and Aggr(h1)= h1.Hence,BASGD can achieve the same performance as We compare the performance of different methods under two ASGD when there is no failure or attack. types of attack:negative gradient attack (NG-attack)and random disturbance attack(RD-attack).Byzantine workers Then,for each type of attack,we conduct two experiments with NG-attack send gNG =-katk'g to server,where g is in which there are 3 and 6 Byzantine workers,respectively the true gradient and katkER is a parameter.Byzantine We respectively set 10 and 15 buffers for BASGD in these workers with RD-attack send grD =g+grnd to server, two experiments.For space saving,we only present average where grnd is a random vector sampled from normal distri- top-1 test accuracy in Figure 3(b)and Figure 3(e)(3 Byzan- bution (0,lloatkgll2.I).Here,atk is a parameter and I tine workers),and Figure 3(c)and Figure 3(f)(6 Byzantine is an identity matrix.NG-attack is a typical kind of mali- workers).Results about training loss are in Appendix C.We cious attack,while RD-attack can be seen as an accidental can find that BASGD significantly outperforms ASGD and
BASGD: Buffered Asynchronous SGD for Byzantine Learning exist few works revealing the tightness of τmax in asynchronous BL, and we will leave this for future work. In general cases, Theorem 2 guarantees BASGD to find a point such that the squared L2-norm of its gradient is not larger than A1 (but not necessarily around a stationary point), in expectation. Please note that Assumption 3 already guarantees that gradient’s squared L2-norm is not larger than D2 . We introduce Proposition 2 to show that A1 is guaranteed to be smaller than D2 under a mild condition. Proposition 2. Aggr(·) is an (A1, A2)-effective aggregation function, and Gt syn is aggregated by Aggr(·) in synchronous setting. If E[kGt syn − ∇F(wt )k | wt ] ≤ D, ∀wt ∈ R d , then we have A1 ≤ D2 . Gt syn is the aggregated result of Aggr(·), and is a robust estimator of ∇F(wt ) used for updating. Since k∇F(wt )k ≤ D, ∇F(wt ) locates in a ball with radius D. E[kGt syn − ∇F(wt )k | wt ] ≤ D means that the bias of Gt syn is not larger than the radius D, which is a mild condition for Aggr(·). As many existing works have shown (Assran et al., 2020; Nokleby et al., 2020), speed-up is also an important aspect of distributed learning methods. In BASGD, different workers can compute gradients concurrently, make each buffer be filled more quickly, and thus speed up the model updating. However, we mainly focus on Byzantine-resilience in this work. Speed-up will be thoroughly studied in future work. 5. Experiment In this section, we empirically evaluate the performance of BASGD and baselines in both image classification (IC) and natural language processing (NLP) applications. Our experiments are conducted on a distributed platform with dockers. Each docker is bound to an NVIDIA Tesla V100 (32G) GPU (in IC) or an NVIDIA Tesla K80 GPU (in NLP). Please note that different GPU cards do not affect the reported metrics in the experiment. We choose 30 dockers as workers in IC, and 8 dockers in NLP. An extra docker is chosen as server. All algorithms are implemented with PyTorch 1.3. 5.1. Experimental Setting We compare the performance of different methods under two types of attack: negative gradient attack (NG-attack) and random disturbance attack (RD-attack). Byzantine workers with NG-attack send g˜NG = −katk · g to server, where g is the true gradient and katk ∈ R+ is a parameter. Byzantine workers with RD-attack send g˜RD = g + grnd to server, where grnd is a random vector sampled from normal distribution N (0, kσatkgk 2 · I). Here, σatk is a parameter and I is an identity matrix. NG-attack is a typical kind of malicious attack, while RD-attack can be seen as an accidental failure with expectation 0. Besides, each worker is manually set to have a delay, which is kdel times the computing time. Training set is randomly and equally distributed to different workers. We use the average top-1 test accuracy (in IC) or average perplexity (in NLP) on all workers w.r.t. epochs as final metrics. For BASGD, we use median and trimmedmean as aggregation function. Reassignment interval is set to be 1 second. Top-1 test accuracy (in IC) w.r.t. wall-clock time of BASGD with more aggregation functions is reported in Appendix C of supplementary material. Because BASGD is an ABL method, SBL methods cannot be directly compared with BASGD. The ABL method Zeno++ either cannot be directly compared with BASGD, because Zeno++ needs to store some instances on server. The number of instances stored on server will affect the performance of Zeno++ (Xie et al., 2020). Hence, we compare BASGD with ASGD and Kardam in our experiments. We set dampening function Λ(τ ) = 1 1+τ for Kardam as suggested in (Damaskinos et al., 2018). 5.2. Image Classification Experiment In IC experiment, algorithms are evaluated on CIFAR- 10 (Krizhevsky et al., 2009) with deep learning model ResNet-20 (He et al., 2016). Cross-entropy is used as the loss function. We set katk = 10 for NG-attack, and σatk = 0.2 for RD-attack. kdel is randomly sampled from truncated standard normal distribution within [0, +∞). As suggested in (He et al., 2016), learning rate η is set to 0.1 initially for each algorithm, and multiplied by 0.1 at the 80-th epoch and the 120-th epoch respectively. The weight decay is set to 10−4 . We run each algorithm for 160 epochs. Batch size is set to 25. Firstly, we compare the performance of different methods when there are no Byzantine workers. Experimental results with median and trmean aggregation functions are illustrated in Figure 3(a) and Figure 3(d), respectively. ASGD achieves the best performance. BASGD (B > 1) and Kardam have similar convergence rate to ASGD, but both sacrifice a little accuracy. Besides, the performance of BASGD gets worse when the buffer number B increases, which is consistent with the theoretical results. Please note that ASGD is a degenerated case of BASGD when B = 1 and Aggr(h1) = h1. Hence, BASGD can achieve the same performance as ASGD when there is no failure or attack. Then, for each type of attack, we conduct two experiments in which there are 3 and 6 Byzantine workers, respectively. We respectively set 10 and 15 buffers for BASGD in these two experiments. For space saving, we only present average top-1 test accuracy in Figure 3(b) and Figure 3(e) (3 Byzantine workers), and Figure 3(c) and Figure 3(f) (6 Byzantine workers). Results about training loss are in Appendix C. We can find that BASGD significantly outperforms ASGD and
BASGD:Buffered Asynchronous SGD for Byzantine Learning ASGD ASGD -ASGD BASGD -BASGD with median但=0j BASGD wih median(B=15) -BASGD with median (B=30 -BASGD with trmean (B=10,q=3) -BASGO wih trmean(B=15.g=6 0 80 100 120 40 0 120 140 180 20 0 60 2010 Epoch (a)no attack (b)3 Byzantine workers (RD-attack) (c)6 Byzantine workers (RD-attack) 05 80 ◆ ASGD BASGD with trmean (B=10.q=3) ◆一Kardam(=3】 3nB=5.c=1 -Kardam (=6) n=10 -Kardam (y=10) HA3 th tmmean(B=15,g=5) ★Krdm(=14)】 mean (B=30,q=10 Kardam (=10) ★ ★中中 40 60 20100120140 160 20 40800100120140160 20 4008010020t4016 Epoch Epoch Epoch (d)no attack (e)3 Byzantine workers(NG-attack) (f)6 Byzantine workers(NG-attack) Figure 3.Average top-1 test accuracy w.r.t.epochs when there are no Byzantine workers(left column).3 Byzantine workers(middle column)and 6 Byzantine workers(right column).respectively.Subfigures (b)and (c)are for RD-attack,while (e)and (f)for NG-attack. Table 1.Filtered ratio of received gradients in Kardam under NG-attack in IC task(3 Byzantine workers) TERM BY FREQUENCY FILTER BY LIPSCHITZ FILTER IN TOTAL LOYAL GRADS (Y=3) 10.15%(31202/307530) 40.97%(126000/307530) 51.12% BYZANTINE GRADS (Y=3) 10.77%(3681/34170) 40.31%(13773/34170) 51.08% LOYAL GRADS (Y=8) 28.28%(86957/307530) 28.26%(86893/307530)】 56.53% BYZANTINE GRADS (Y=8) 28.38%(9699/34170) 28.06%(9588/34170) 56.44% LOYAL GRADS (y=14) 85.13%(261789/307530 3.94%(12117/307530) 89.07% BYZANTINE GRADS (y=14) 84.83%(28985/34170) 4.26%(1455/34170】 89.08% Kardam under both RD-attack (accidental failure)and NG- 5.3.Natural Language Processing Experiment attack(malicious attack).Under the less harmful RD-attack, although ASGD and Kardam still converge,they both suf- In NLP experiment,the algorithms are evaluated on the WikiText-2 dataset with LSTM (Hochreiter Schmidhuber. fer a significant loss on accuracy.Under NG-attack,both ASGD and Kardam cannot converge,even if we have tried 1997)networks.We only use the training set and test set, different values of assumed Byzantine worker number for while the validation set is not used in our experiment.For Kardam,which is denoted by a hyper-parameter y in this LSTM,we adopt 2 layers with 100 units in each.Word paper.Hence,both ASGD and Kardam cannot resist mali- embedding size is set to 100,and sequence length is set to cious attack.On the contrary.BASGD still has a relatively 35.Gradient clipping size is set to 0.25.Cross-entropy is good performance under both types of attack. used as the loss function.For each algorithm,we run each algorithm for 40 epochs.Initial learning rate n is chosen Moreover,we count the ratio of filtered gradients in Kardam, from {1,2,5,10,20),and is divided by 4 every 10 epochs. which is shown in Table 1.We can find that in order to The best test result is adopted as the final one filter Byzantine gradients,Kardam also filters approximately equal ratio of loyal gradients.It explains why Kardam The performance of ASGD under no attack is used as gold performs poorly under malicious attack. standard.We set katk =10 and atk =0.1.One of the eight workers is Byzantine.kdet is randomly sampled from
BASGD: Buffered Asynchronous SGD for Byzantine Learning 0 20 40 60 80 100 120 140 160 Epoch 40 45 50 55 60 65 70 75 80 85 90 95 Average Top-1 Accuracy ASGD BASGD with median (B=5) BASGD with median (B=10) BASGD with median (B=15) BASGD with median (B=30) Kardam ( =2) Kardam ( =10) (a) no attack 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=10) BASGD with trmean (B=10, q=3) Kardam ( =3) Kardam ( =6) (b) 3 Byzantine workers (RD-attack) 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=15) BASGD with trmean (B=15, q=6) Kardam ( =6) Kardam ( =10) (c) 6 Byzantine workers (RD-attack) 0 20 40 60 80 100 120 140 160 Epoch 40 45 50 55 60 65 70 75 80 85 90 95 Average Top-1 Accuracy ASGD BASGD with trmean (B=5, q=1) BASGD with trmean (B=10, q=3) BASGD with trmean (B=15, q=5) BASGD with trmean (B=30, q=10) Kardam ( =2) Kardam ( =10) (d) no attack 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=10) BASGD with trmean (B=10, q=3) Kardam ( =3) Kardam ( =8) Kardam ( =14) (e) 3 Byzantine workers (NG-attack) 0 20 40 60 80 100 120 140 160 Epoch 0 10 20 30 40 50 60 70 80 90 Average Top-1 Accuracy ASGD BASGD with median (B=15) BASGD with trmean (B=15, q=6) Kardam ( =6) Kardam ( =10) Kardam ( =14) (f) 6 Byzantine workers (NG-attack) Figure 3. Average top-1 test accuracy w.r.t. epochs when there are no Byzantine workers (left column), 3 Byzantine workers (middle column) and 6 Byzantine workers (right column), respectively. Subfigures (b) and (c) are for RD-attack, while (e) and (f) for NG-attack. Table 1. Filtered ratio of received gradients in Kardam under NG-attack in IC task (3 Byzantine workers) TERM BY FREQUENCY FILTER BY LIPSCHITZ FILTER IN TOTAL LOYAL GRADS (γ = 3) 10.15% (31202/307530) 40.97% (126000/307530) 51.12% BYZANTINE GRADS (γ = 3) 10.77% (3681/34170) 40.31% (13773/34170) 51.08% LOYAL GRADS (γ = 8) 28.28% (86957/307530) 28.26% (86893/307530) 56.53% BYZANTINE GRADS (γ = 8) 28.38% (9699/34170) 28.06% (9588/34170) 56.44% LOYAL GRADS (γ = 14) 85.13% (261789/307530) 3.94% (12117/307530) 89.07% BYZANTINE GRADS (γ = 14) 84.83% (28985/34170) 4.26% (1455/34170) 89.08% Kardam under both RD-attack (accidental failure) and NGattack (malicious attack). Under the less harmful RD-attack, although ASGD and Kardam still converge, they both suffer a significant loss on accuracy. Under NG-attack, both ASGD and Kardam cannot converge, even if we have tried different values of assumed Byzantine worker number for Kardam, which is denoted by a hyper-parameter γ in this paper. Hence, both ASGD and Kardam cannot resist malicious attack. On the contrary, BASGD still has a relatively good performance under both types of attack. Moreover, we count the ratio of filtered gradients in Kardam, which is shown in Table 1. We can find that in order to filter Byzantine gradients, Kardam also filters approximately equal ratio of loyal gradients. It explains why Kardam performs poorly under malicious attack. 5.3. Natural Language Processing Experiment In NLP experiment, the algorithms are evaluated on the WikiText-2 dataset with LSTM (Hochreiter & Schmidhuber, 1997) networks. We only use the training set and test set, while the validation set is not used in our experiment. For LSTM, we adopt 2 layers with 100 units in each. Word embedding size is set to 100, and sequence length is set to 35. Gradient clipping size is set to 0.25. Cross-entropy is used as the loss function. For each algorithm, we run each algorithm for 40 epochs. Initial learning rate η is chosen from {1, 2, 5, 10, 20}, and is divided by 4 every 10 epochs. The best test result is adopted as the final one. The performance of ASGD under no attack is used as gold standard. We set katk = 10 and σatk = 0.1. One of the eight workers is Byzantine. kdel is randomly sampled from
BASGD:Buffered Asynchronous SGD for Byzantine Learning 900 700 -ASGD (no attack) 500 -BASGD with median (B4) 400 300 ◆ardam1 200 A-Kardam (3) ASGD 00 10 15 35 5 Epoch 10 (a)RD-attack (b)RD-attack (magnified) t00 800 700 s00 500 e-ASGD (no attack) --BASGD with median (B4) 400 300 20 0””2站0站40 5 050若的 Epoch Epoch (c)NG-attack (d)NG-attack (magnified) Figure 4.Average perplexity w.r.t.epochs with 1 Byzantine worker.Subfigures(a)and(b)are for RD-attack,while Subfigures(c)and (d)for NG-attack.Due to the differences in magnitude of perplexity,y-axes of Subfigures(a)and(c)are in log-scale.In addition, Subfigures (b)and(d)illustrate that BASGD converges with only a little loss in perplexity compared to the gold standard. exponential distribution with parameter A=1.Each experi- search Project (No.61861146001)and NSFC Project(No. ment is carried out for 3 times,and the average perplexity is 61921006). reported in Figure 4.We can find that BASGD converges under each kind of attack,with only a little loss in perplexity compared to the gold standard(ASGD without attack).On References the other hand,ASGD and Kardam both fail,even if we Alistarh,D..Allen-Zhu,Z.,and Li,J.Byzantine stochastic have set the largest y(y=3)for Kardam. gradient descent.In Advances in Neural Information Processing Systems,pp.4613-4623,2018. 6.Conclusion Assran,B.M.,Aytekin,A.,Feyzmahdavian,H.R.,Johans- In this paper,we propose a novel method called BASGD for son,M.,and Rabbat,M.G.Advances in asynchronous asynchronous Byzantine learning.To the best of our knowl- parallel and distributed optimization.Proceedings of the edge,BASGD is the first ABL method that can resist mali- IEEE,108(11):2013-2031,2020. cious attack without storing any instances on server.Com- pared with those methods which need to store instances on Baruch,G.,Baruch,M.,and Goldberg.Y.A little is enough: server,BASGD has a wider scope of application.BASGD Circumventing defenses for distributed learning.In Ad- is proved to be convergent,and be able to resist failure or vances in Neural Information Processing Systems,pp attack.Empirical results show that BASGD significantly 8635-8645.2019. outperforms vanilla ASGD and other ABL baselines,when there exists failure or attack on workers. Bernstein,J.,Zhao,J.,Azizzadenesheli,K.,and Anandku- mar,A.signSGD with majority vote is communication Acknowledgements efficient and fault tolerant.In Proceedings of the Interna- tional Conference on Learning Representations,2019. This work is supported by National Key R&D Program of China(No.2020YFA0713900),NSFC-NRF Joint Re- Blanchard.P..Guerraoui.R..Stainer,J.,et al.Machine learning with adversaries:Byzantine tolerant gradient
BASGD: Buffered Asynchronous SGD for Byzantine Learning 0 5 10 15 20 25 30 35 40 Epoch 0 50 100 150 Logarithm of Perplexity ASGD (no attack) BASGD with median (B=4) Kardam ( =1) Kardam ( =3) ASGD (a) RD-attack 0 5 10 15 20 25 30 35 40 Epoch 0 100 200 300 400 500 600 700 800 900 1000 Perplexity ASGD (no attack) BASGD with median (B=4) (b) RD-attack (magnified) 0 5 10 15 20 25 30 35 40 Epoch 0 1 2 3 4 5 6 7 8 9 10 Logarithm of Perplexity 104 ASGD (no attack) BASGD with median (B=4) Kardam ( =1) Kardam ( =3) ASGD (c) NG-attack 0 5 10 15 20 25 30 35 40 Epoch 0 100 200 300 400 500 600 700 800 900 1000 Perplexity ASGD (no attack) BASGD with median (B=4) (d) NG-attack (magnified) Figure 4. Average perplexity w.r.t. epochs with 1 Byzantine worker. Subfigures (a) and (b) are for RD-attack, while Subfigures (c) and (d) for NG-attack. Due to the differences in magnitude of perplexity, y-axes of Subfigures (a) and (c) are in log-scale. In addition, Subfigures (b) and (d) illustrate that BASGD converges with only a little loss in perplexity compared to the gold standard. exponential distribution with parameter λ = 1. Each experiment is carried out for 3 times, and the average perplexity is reported in Figure 4. We can find that BASGD converges under each kind of attack, with only a little loss in perplexity compared to the gold standard (ASGD without attack). On the other hand, ASGD and Kardam both fail, even if we have set the largest γ (γ = 3) for Kardam. 6. Conclusion In this paper, we propose a novel method called BASGD for asynchronous Byzantine learning. To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD has a wider scope of application. BASGD is proved to be convergent, and be able to resist failure or attack. Empirical results show that BASGD significantly outperforms vanilla ASGD and other ABL baselines, when there exists failure or attack on workers. Acknowledgements This work is supported by National Key R&D Program of China (No. 2020YFA0713900), NSFC-NRF Joint Research Project (No. 61861146001) and NSFC Project (No. 61921006). References Alistarh, D., Allen-Zhu, Z., and Li, J. Byzantine stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 4613–4623, 2018. Assran, B. M., Aytekin, A., Feyzmahdavian, H. R., Johansson, M., and Rabbat, M. G. Advances in asynchronous parallel and distributed optimization. Proceedings of the IEEE, 108(11):2013–2031, 2020. Baruch, G., Baruch, M., and Goldberg, Y. A little is enough: Circumventing defenses for distributed learning. In Advances in Neural Information Processing Systems, pp. 8635–8645, 2019. Bernstein, J., Zhao, J., Azizzadenesheli, K., and Anandkumar, A. signSGD with majority vote is communication efficient and fault tolerant. In Proceedings of the International Conference on Learning Representations, 2019. Blanchard, P., Guerraoui, R., Stainer, J., et al. Machine learning with adversaries: Byzantine tolerant gradient