BASGD:Buffered Asynchronous SGD for Byzantine Learning Yi-Rui Yang Wu-Jun Li Abstract Lee et al.,2017:Lian et al.,2017:Zhao et al.,2017:Sun Distributed learning has become a hot research et al.,2018;Wangni et al.,2018;Zhao et al.,2018;Zhou topic due to its wide application in cluster-based et al.,2018;Yu et al.,2019a;b;Haddadpour et al.,2019). large-scale learning,federated learning,edge com- Most traditional distributed learning methods are based on puting and so on.Most traditional distributed stochastic gradient descent(SGD)and its variants(Bottou, learning methods typically assume no failure or 2010;Xiao,2010;Duchi et al.,2011;Johnson Zhang, attack.However,many unexpected cases,such as 2013;Shalev-Shwartz Zhang,2013;Zhang et al.,2013; communication failure and even malicious attack, Lin et al.,2014;Schmidt et al.,2017;Zheng et al.,2017; may happen in real applications.Hence,Byzan- Zhao et al.,2018),and typically assume no failure or attack. tine learning (BL),which refers to distributed However,in real distributed learning applications with mul- learning with failure or attack,has recently at- tiple networked machines(nodes),different kinds of hard- tracted much attention.Most existing BL meth- ware or software failure may happen.Representative failure ods are synchronous,which are impractical in includes bit-flipping in the communication media and the some applications due to heterogeneous or offline memory of some workers (Xie et al.,2019).In this case, workers.In these cases,asynchronous BL (ABL) small failure on some machines (workers)might cause a is usually preferred.In this paper,we propose distributed learning method to fail.In addition,malicious a novel method,called buffered asynchronous attack should not be neglected in an open network where stochastic gradient descent(BASGD),for ABL. the manager(or server)generally has not much control on To the best of our knowledge.BASGD is the the workers,such as the cases of edge computing and feder- first ABL method that can resist malicious attack ated learning.Malicious workers may behave arbitrarily or without storing any instances on server.Com- even adversarially.Hence,Byzantine learning (BL),which pared with those methods which need to store refers to distributed learning with failure or attack,has at- instances on server,BASGD has a wider scope of tracted much attention (Diakonikolas et al.,2017;Chen application.BASGD is proved to be convergent, et al.,2017;Blanchard et al,2017;Damaskinos et al.,2018; and be able to resist failure or attack.Empirical Baruch et al..2019:Diakonikolas Kane.2019). results show that BASGD significantly outper- forms vanilla asynchronous stochastic gradient Existing BL methods can be divided into two main cate- descent (ASGD)and other ABL baselines when gories:synchronous BL(SBL)methods and asynchronous there exists failure or attack on workers. BL(ABL)methods.In SBL methods,the learning infor- mation,such as the gradient in SGD,of all workers will be aggregated in a synchronous way.On the contrary,in 1.Introduction ABL methods the learning information of workers will be aggregated in an asynchronous way.Existing SBL methods Due to the wide application in cluster-based large-scale mainly take two different ways to achieve resilience against learning,federated learning(Konevcny et al.,2016;Kairouz Byzantine workers which refer to those workers with failure et al.,2019),edge computing (Shi et al.,2016)and so on, or attack.One way is to replace the simple averaging aggre- distributed learning has recently become a hot research gation operation with some more robust aggregation opera- topic (Zinkevich et al.,2010;Yang,2013;Jaggi et al.,2014; tions,such as median and trimmed-mean (Yin et al..2018) Shamir et al.,2014:Zhang Kwok,2014:Ma et al.,2015: Krum (Blanchard et al.,2017)and ByzantinePGD (Yin National Key Laboratory for Novel Software Technology, et al.,2019)take this way.The other way is to filter the sus- Department of Computer Science and Technology,Nanjing picious learning information (gradients)before averaging. University,China.Correspondence to:Wu-Jun Li <liwu- Representative examples include ByzantineSGD(Alistarh jun@nju.edu.cn>. et al.,2018)and Zeno (Xie et al.,2019).The advantage Proceedings of the 38th International Conference on Machine of SBL methods is that they are relatively simple and easy Learning.PMLR 139,2021.Copyright 2021 by the author(s). to be implemented.But SBL methods will result in slow
BASGD: Buffered Asynchronous SGD for Byzantine Learning Yi-Rui Yang 1 Wu-Jun Li 1 Abstract Distributed learning has become a hot research topic due to its wide application in cluster-based large-scale learning, federated learning, edge computing and so on. Most traditional distributed learning methods typically assume no failure or attack. However, many unexpected cases, such as communication failure and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has recently attracted much attention. Most existing BL methods are synchronous, which are impractical in some applications due to heterogeneous or offline workers. In these cases, asynchronous BL (ABL) is usually preferred. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD has a wider scope of application. BASGD is proved to be convergent, and be able to resist failure or attack. Empirical results show that BASGD significantly outperforms vanilla asynchronous stochastic gradient descent (ASGD) and other ABL baselines when there exists failure or attack on workers. 1. Introduction Due to the wide application in cluster-based large-scale learning, federated learning (Konevcny et al. ` , 2016; Kairouz et al., 2019), edge computing (Shi et al., 2016) and so on, distributed learning has recently become a hot research topic (Zinkevich et al., 2010; Yang, 2013; Jaggi et al., 2014; Shamir et al., 2014; Zhang & Kwok, 2014; Ma et al., 2015; 1National Key Laboratory for Novel Software Technology, Department of Computer Science and Technology, Nanjing University, China. Correspondence to: Wu-Jun Li <liwujun@nju.edu.cn>. Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). Lee et al., 2017; Lian et al., 2017; Zhao et al., 2017; Sun et al., 2018; Wangni et al., 2018; Zhao et al., 2018; Zhou et al., 2018; Yu et al., 2019a;b; Haddadpour et al., 2019). Most traditional distributed learning methods are based on stochastic gradient descent (SGD) and its variants (Bottou, 2010; Xiao, 2010; Duchi et al., 2011; Johnson & Zhang, 2013; Shalev-Shwartz & Zhang, 2013; Zhang et al., 2013; Lin et al., 2014; Schmidt et al., 2017; Zheng et al., 2017; Zhao et al., 2018), and typically assume no failure or attack. However, in real distributed learning applications with multiple networked machines (nodes), different kinds of hardware or software failure may happen. Representative failure includes bit-flipping in the communication media and the memory of some workers (Xie et al., 2019). In this case, small failure on some machines (workers) might cause a distributed learning method to fail. In addition, malicious attack should not be neglected in an open network where the manager (or server) generally has not much control on the workers, such as the cases of edge computing and federated learning. Malicious workers may behave arbitrarily or even adversarially. Hence, Byzantine learning (BL), which refers to distributed learning with failure or attack, has attracted much attention (Diakonikolas et al., 2017; Chen et al., 2017; Blanchard et al., 2017; Damaskinos et al., 2018; Baruch et al., 2019; Diakonikolas & Kane, 2019). Existing BL methods can be divided into two main categories: synchronous BL (SBL) methods and asynchronous BL (ABL) methods. In SBL methods, the learning information, such as the gradient in SGD, of all workers will be aggregated in a synchronous way. On the contrary, in ABL methods the learning information of workers will be aggregated in an asynchronous way. Existing SBL methods mainly take two different ways to achieve resilience against Byzantine workers which refer to those workers with failure or attack. One way is to replace the simple averaging aggregation operation with some more robust aggregation operations, such as median and trimmed-mean (Yin et al., 2018). Krum (Blanchard et al., 2017) and ByzantinePGD (Yin et al., 2019) take this way. The other way is to filter the suspicious learning information (gradients) before averaging. Representative examples include ByzantineSGD (Alistarh et al., 2018) and Zeno (Xie et al., 2019). The advantage of SBL methods is that they are relatively simple and easy to be implemented. But SBL methods will result in slow
BASGD:Buffered Asynchronous SGD for Byzantine Learning convergence when there exist heterogeneous workers.Fur- where w is the parameter to learn,d is the dimension of thermore,in some applications like federated learning and parameter,n is the number of training instances,f(w;z;)is edge computing,synchronization cannot even be performed the empirical loss on the instance z;.The goal of distributed most of the time due to the offline workers(clients or edge learning is to solve the problem in(1)by designing learning servers).Hence,ABL is preferred in these cases. algorithms based on multiple networked machines. To the best of our knowledge.there exist only two Although there have appeared many distributed learning ABL methods:Kardam (Damaskinos et al.,2018)and frameworks,in this paper we focus on the widely used Zeno++(Xie et al..2020).Kardam introduces two filters to Parameter Server (PS)framework (Li et al.,2014).In a drop out suspicious learning information(gradients),which PS framework.there are several workers and one or more can still achieve good performance when the communication servers.Each worker can only communicate with server(s). delay is heavy.However,when in face of malicious attack, There may exist more than one server in a PS framework. some work (Xie et al..2020)finds that Kardam also drops but for the problem of this paper servers can be logically con- out most correct gradients in order to filter all faulty(failure) ceived as a unity.Without loss of generality,we will assume gradients.Hence,Kardam cannot resist malicious attack. there is only one server in this paper.Training instances are Zeno++needs to store some training instances on server disjointedly distributed across m workers.Let D denote for scoring.In some practical applications like federated the index set of training instances on worker_k,we have learning (Kairouz et al.,2019),storing data on server will URD ={1,2,...:n}and DnD =0 ifk'.In increase the risk of privacy leakage or even face legal risk. this paper,we assume that server has no access to any train- Therefore,under the general setting where server has no ing instances.If two instances have the same value,they are access to any training instances,there does not exist ABL still deemed as two distinct instances.Namely,z;may equal methods which can resist malicious attack. z;(ii).One popular asynchronous method to solve the problem in (1)under the PS framework is ASGD (Dean In this paper,we propose a novel method,called buffered asynchronous stochastic gradient descent(BASGD),for et al.,2012)(see Appendix A of supplementary materials for ABL.The main contributions are listed as follows: details).In this paper,we assume each worker samples one instance for gradient computation each time.The analysis To the best of our knowledge.BASGD is the first ABL of mini-batch case is similar. method that can resist malicious attack without storing In PS based ASGD,server is responsible for updating and any instances on server.Compared with those methods maintaining the latest parameter.The number of iterations which need to store instances on server,BASGD has a that server has already executed is used as the global logical wider scope of application. clock of server.At the beginning,iteration number t =0. BASGD is theoretically proved to be convergent,and Each time a SGD step is executed,t will increase by 1 be able to resist failure or attack immediately.The parameter after t iterations is denoted as wt.If server sends parameters to worker_k at iteration Empirical results show that BASGD significantly out- t',some SGD steps may have been excuted before server performs vanilla asynchronous stochastic gradient de- receives gradient from worker_k next time at iteration t. scent(ASGD)and other ABL baselines when there Thus,we define the delay of workerk at iteration t as r= exist failure or malicious attack on workers.In particu- t-t'.Worker_k is heavily delayed at iteration t if r> lar,BASGD can still converge under malicious attack, Tmaz,where Tmar is a pre-defined non-negative constant. when ASGD and other ABL methods fail. 2.2.Byzantine Worker 2.Preliminary For workers that have sent gradients (one or more)to server In this section,we present the preliminary of this paper, at iteration t,we call worker_k loyal worker if it has finished including the distributed learning framework used in this all the tasks without any fault and each sent gradient is paper and the definition of Byzantine worker. correctly received by the server.Otherwise,workerk is called Byzantine worker.If worker_k is a Byzantine worker, 2.1.Distributed Learning Framework it means the received gradient from worker_k is not credible, which can be an arbitrary value.In ASGD,there is one Many machine learning models,such as logistic regression received gradient at a time.Formally,we denote the gradient and deep neural networks,can be formulated as the follow- received from worker_k at iteration t as g.Then,we have: ing finite sum optimization problem: mFw)=∑fw2, Vf(w;),if worker_k:is loyal at iteration t; wE i=1 if worker k is Byzantine at iteration t
BASGD: Buffered Asynchronous SGD for Byzantine Learning convergence when there exist heterogeneous workers. Furthermore, in some applications like federated learning and edge computing, synchronization cannot even be performed most of the time due to the offline workers (clients or edge servers). Hence, ABL is preferred in these cases. To the best of our knowledge, there exist only two ABL methods: Kardam (Damaskinos et al., 2018) and Zeno++ (Xie et al., 2020). Kardam introduces two filters to drop out suspicious learning information (gradients), which can still achieve good performance when the communication delay is heavy. However, when in face of malicious attack, some work (Xie et al., 2020) finds that Kardam also drops out most correct gradients in order to filter all faulty (failure) gradients. Hence, Kardam cannot resist malicious attack. Zeno++ needs to store some training instances on server for scoring. In some practical applications like federated learning (Kairouz et al., 2019), storing data on server will increase the risk of privacy leakage or even face legal risk. Therefore, under the general setting where server has no access to any training instances, there does not exist ABL methods which can resist malicious attack. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for ABL. The main contributions are listed as follows: • To the best of our knowledge, BASGD is the first ABL method that can resist malicious attack without storing any instances on server. Compared with those methods which need to store instances on server, BASGD has a wider scope of application. • BASGD is theoretically proved to be convergent, and be able to resist failure or attack. • Empirical results show that BASGD significantly outperforms vanilla asynchronous stochastic gradient descent (ASGD) and other ABL baselines when there exist failure or malicious attack on workers. In particular, BASGD can still converge under malicious attack, when ASGD and other ABL methods fail. 2. Preliminary In this section, we present the preliminary of this paper, including the distributed learning framework used in this paper and the definition of Byzantine worker. 2.1. Distributed Learning Framework Many machine learning models, such as logistic regression and deep neural networks, can be formulated as the following finite sum optimization problem: min w∈Rd F(w) = 1 n Xn i=1 f(w; zi), (1) where w is the parameter to learn, d is the dimension of parameter, n is the number of training instances, f(w; zi) is the empirical loss on the instance zi . The goal of distributed learning is to solve the problem in (1) by designing learning algorithms based on multiple networked machines. Although there have appeared many distributed learning frameworks, in this paper we focus on the widely used Parameter Server (PS) framework (Li et al., 2014). In a PS framework, there are several workers and one or more servers. Each worker can only communicate with server(s). There may exist more than one server in a PS framework, but for the problem of this paper servers can be logically conceived as a unity. Without loss of generality, we will assume there is only one server in this paper. Training instances are disjointedly distributed across m workers. Let Dk denote the index set of training instances on worker k, we have ∪ m k=1Dk = {1, 2, . . . , n} and Dk ∩ Dk0 = ∅ if k 6= k 0 . In this paper, we assume that server has no access to any training instances. If two instances have the same value, they are still deemed as two distinct instances. Namely, zi may equal zi 0 (i 6= i 0 ). One popular asynchronous method to solve the problem in (1) under the PS framework is ASGD (Dean et al., 2012) (see Appendix A of supplementary materials for details). In this paper, we assume each worker samples one instance for gradient computation each time. The analysis of mini-batch case is similar. In PS based ASGD, server is responsible for updating and maintaining the latest parameter. The number of iterations that server has already executed is used as the global logical clock of server. At the beginning, iteration number t = 0. Each time a SGD step is executed, t will increase by 1 immediately. The parameter after t iterations is denoted as wt . If server sends parameters to worker k at iteration t 0 , some SGD steps may have been excuted before server receives gradient from worker k next time at iteration t. Thus, we define the delay of worker k at iteration t as τ t k = t − t 0 . Worker k is heavily delayed at iteration t if τ t k > τmax, where τmax is a pre-defined non-negative constant. 2.2. Byzantine Worker For workers that have sent gradients (one or more) to server at iteration t, we call worker k loyal worker if it has finished all the tasks without any fault and each sent gradient is correctly received by the server. Otherwise, worker k is called Byzantine worker. If worker k is a Byzantine worker, it means the received gradient from worker k is not credible, which can be an arbitrary value. In ASGD, there is one received gradient at a time. Formally, we denote the gradient received from worker k at iteration t as g t k . Then, we have: g t k = ( ∇f(wt 0 ; zi), if worker k is loyal at iteration t; ∗, if worker k is Byzantine at iteration t
BASGD:Buffered Asynchronous SGD for Byzantine Learning Algorithm 1 Buffered Asynchronous SGD(BASGD) Server: Input:learning rate n.reassignment interval A, Workers buffer number B,aggregation function:Aggr(); Initialization:initial parameter wo,learning rate n: 0 12 13 Set buffer:hb←0,Ng←0: Initialize mapping table Bss (s =0,1,...,m-1); Buffer_O Buffer_ Send initial wo to all workers; Set t 0,and start the timer; repeat Figure 1.An example of buffers.Circle represents worker,and the Wait until receiving g from some worker-s; number is worker ID.There are 15 workers and 5 buffers.The Choose buffer:bB mod B: gradient received from worker_s is stored in buffer-{s mod 5). LetN店←N话+l,and he←-h+s: N ifN话>0 for each b∈[B]then Aggregate:G=Aggr([h1,...,hB]): Execute SGD step:wt+i←w-n·G; received gradient is credible or not. Zero out buffers:h←0,Nt←-0(b=l,,B); In order to deal with this problem in asynchronous BL, Set tt+1.and restart the timer: we propose a novel method called buffered asynchronous end if SGD(BASGD).BASGD introduces B buffers(0<B if the timer has exceeded A seconds then m)on server,and the gradient used for updating parameters Zero out buffers:h6←0,N5←0(b=1,.,B): will be aggregated from these buffers.The detail of the Modify the mapping table for buffer reas- learning procedure of BASGD is presented in Algorithm 1. signment,and restart the timer; In this section,we will introduce the three key components end if of BASGD:buffer,aggregation function,and mapping table Send back the latest parameters back to worker_s,no matter whether a SGD step is executed or not. 3.1.Buffer until stop criterion is satisfied Notify all workers to stop; In BASGD,the m workers do the same job as that in ASGD, while the updating rule on server is modified.More specifi- Worker_k:=0,1,...,m-1) cally,there are B buffers(0<B<m)on server.When a repeat gradient g from worker-s is received,it will be temporarily Wait until receiving the latest parameter w from server; stored in buffer b,where b=s mod B.as illustrated in Randomly sample an index i from D; Figure 1.Only when each buffer has stored at least one gra- Compute Vf(w;z): dient,a new SGD step will be executed.Please note that no Send Vf(w;zi)to server; matter whether a SGD step is executed or not,the server will until receive server's notification to stop immediately send the latest parameters back to the worker after receiving a gradient.Hence,BASGD introduces no barrier,and is an asynchronous algorithm. where 0≤t'≤t,and i is randomly sampled from D.‘* For each buffer b,more than one gradient may have been represents an arbitrary value.Our definition of Byzantine received at iteration t.We will store the average of these worker is consistent with most previous works(Blanchard gradients (denoted by h)in buffer b.Assume that there are et al.,2017;Xie et al.,2019;2020).Either accidental failure already (N-1)gradients gi,g2,...,gN-1 which should or malicious attack will result in Byzantine workers. be stored in buffer and )gi.When the N-th gradient gN is received,the new average value is: 3.Buffered Asynchronous SGD W-1 In synchronous BL,gradients from all workers are received hb(new) N hold+N·g at each iteration.We can compare the gradients with each other,and then filter suspicious ones,or use more robust This is the updating rule for each buffer b when a gradient is aggregation rules such as median and trimmed-mean for received.We use N to denote the total number of gradients updating.However,in asynchronous BL,only one gradient stored in buffer b at the t-th iteration.After the parameter w is received at a time.Without any training instances stored is updated,all buffers will be zeroed out at once.With the on server,it is difficult for server to identify whether a benefit of buffers,server has access to B candidate gradients
BASGD: Buffered Asynchronous SGD for Byzantine Learning Algorithm 1 Buffered Asynchronous SGD (BASGD) Server: Input: learning rate η, reassignment interval ∆, buffer number B, aggregation function: Aggr(·); Initialization: initial parameter w0 , learning rate η; Set buffer: hb ← 0, Nt b ← 0; Initialize mapping table βs ← s (s = 0, 1, . . . , m − 1); Send initial w0 to all workers; Set t ← 0, and start the timer; repeat Wait until receiving g from some worker s; Choose buffer: b ← βs mod B; Let Nt b ← Nt b + 1, and hb ← (Nt b−1)hb+g Nt b ; if Nt b > 0 for each b ∈ [B] then Aggregate: Gt = Aggr([h1, . . . , hB]); Execute SGD step: wt+1 ← wt − η · Gt ; Zero out buffers: hb ← 0, Nt b ← 0 (b = 1, . . . , B); Set t ← t + 1, and restart the timer; end if if the timer has exceeded ∆ seconds then Zero out buffers: hb ← 0, Nt b ← 0 (b = 1, . . . , B); Modify the mapping table {βs} m−1 s=0 for buffer reassignment, and restart the timer; end if Send back the latest parameters back to worker s, no matter whether a SGD step is executed or not. until stop criterion is satisfied Notify all workers to stop; Worker k: (k = 0, 1, ..., m − 1) repeat Wait until receiving the latest parameter w from server; Randomly sample an index i from Dk; Compute ∇f(w; zi); Send ∇f(w; zi) to server; until receive server’s notification to stop where 0 ≤ t 0 ≤ t, and i is randomly sampled from Dk. ‘∗’ represents an arbitrary value. Our definition of Byzantine worker is consistent with most previous works (Blanchard et al., 2017; Xie et al., 2019; 2020). Either accidental failure or malicious attack will result in Byzantine workers. 3. Buffered Asynchronous SGD In synchronous BL, gradients from all workers are received at each iteration. We can compare the gradients with each other, and then filter suspicious ones, or use more robust aggregation rules such as median and trimmed-mean for updating. However, in asynchronous BL, only one gradient is received at a time. Without any training instances stored on server, it is difficult for server to identify whether a Figure 1. An example of buffers. Circle represents worker, and the number is worker ID. There are 15 workers and 5 buffers. The gradient received from worker s is stored in buffer {s mod 5}. received gradient is credible or not. In order to deal with this problem in asynchronous BL, we propose a novel method called buffered asynchronous SGD (BASGD). BASGD introduces B buffers (0 < B ≤ m) on server, and the gradient used for updating parameters will be aggregated from these buffers. The detail of the learning procedure of BASGD is presented in Algorithm 1. In this section, we will introduce the three key components of BASGD: buffer, aggregation function, and mapping table. 3.1. Buffer In BASGD, the m workers do the same job as that in ASGD, while the updating rule on server is modified. More specifi- cally, there are B buffers (0 < B ≤ m) on server. When a gradient g from worker s is received, it will be temporarily stored in buffer b, where b = s mod B, as illustrated in Figure 1. Only when each buffer has stored at least one gradient, a new SGD step will be executed. Please note that no matter whether a SGD step is executed or not, the server will immediately send the latest parameters back to the worker after receiving a gradient. Hence, BASGD introduces no barrier, and is an asynchronous algorithm. For each buffer b, more than one gradient may have been received at iteration t. We will store the average of these gradients (denoted by hb) in buffer b. Assume that there are already (N − 1) gradients g1, g2, . . . , gN−1 which should be stored in buffer b, and hb(old) = 1 N−1 PN−1 i=1 gi . When the N-th gradient gN is received, the new average value is: hb(new) = 1 N X N i=1 gi = N − 1 N · hb(old) + 1 N · gN . This is the updating rule for each buffer b when a gradient is received. We use Nt b to denote the total number of gradients stored in buffer b at the t-th iteration. After the parameter w is updated, all buffers will be zeroed out at once. With the benefit of buffers, server has access to B candidate gradients
BASGD:Buffered Asynchronous SGD for Byzantine Learning when updating parameter.Thus,a more reliable (robust) larger than any of the first B-1 values. gradient can be aggregated from the B gradients of buffers, if a proper aggregation function Aggr()is chosen. The following two aggregation functions are both q-BR. Please note that from the perspective of workers,BASGD Definition 2(Coordinate-wise median (Yin et al..2018)). is fully asynchronous,since a worker will immediately re- For candidate gradients h1,h2,...,hB Rd,h ceive the latest parameter from the server after sending a [hb1,hb2,...,hbd]T,Vb=1,2,...,B.Coordinate-wise gradient to the server,without waiting for other workers. median is defined as: Meanwhile,from the perspective of server,BASGD is semi- asynchronous because the server will not update the model Med([h1,.,hB)=[Medh.1),..,Medh.d)]T until all buffers are filled.However,it is a necessity to limit the updating frequency in ABL when server has no instances. where Med(h.j)is the scalar median of the j-th coordi- If the server always updates the model when receiving a gra- nates,Yj=1,2,...,d. dient,it will be easily foiled when Byzantine workers send Definition 3(Coordinate-wise g-trimmed-mean (Yin et al., gradients much more frequently than others.A similar con- 2018)).For any positive interger q<B2 and candidate clusion has been proved in previous works (Damaskinos gradients h1,h2,...,hB ERd,hb=[hbl,hb2,...,hbd]T, etal.,2018). Vb =1,2,...,B.Coordinate-wise g-trimmed-mean is de- fined as: 3.2.Aggregation Function Trm([hi,...,hB])=[Trm(h.1),....Trm(h.d)T, When a SGD step is ready to be executed,there are B buffers providing candidate gradients.An aggregation function is where Trm(h.j)=eM,huj is the scalar needed to get the final gradient for updating.A naive way trimmed-mean.Mj is the subset of obtained is to take the mean of all candidate gradients.However, by removing the g largest elements and g smallest elements. mean value is sensitive to outliers which are common in BL. For designing proper aggregation functions,we first define In the following content,coordinate-wise median and the q-Byzantine Robust(q-BR)condition to quantitatively coordinate-wise g-trimmed-mean are also called median describe the Byzantine resilience ability of an aggregation and trmean,respectively.Proposition 1 shows the q-BR function. property of these two functions. Definition 1 (q-Byzantine Robust).For an aggregation Proposition 1.Coordinate-wise q-trmean is q-BR,and function Aggr():Aggr(h1,...,hB)=G.where G= coordinate-wise median is-BR. [G1,...,Ga]T and hb [hbl,...,hbd]T,Vb E [B].we call Aggr()q-Byzantine Robust (qZ,0<q<B/2),if Here,represents the maximum integer that is not larger it satisfies the following two properties: than x.According to Proposition 1,both median and trmean (a).Aggr([hi+h',...,hB+h])=Aggr([hi,...,hB])+ are proper choices for aggregation function in BASGD.The h',h1,.,hB∈Ra,h'∈R4: proof can be found in Appendix B of supplementary materi- als.Now we define another class of aggregation functions. (bl.minscS{h}≤Gi≤maxsES{hs},j∈[d, which is also important in analysis in Section 4. VS C[B]with S]=B-q. Definition 4(Stable aggregation function).Aggregation Intuitively.property(a)in Definition I says that if all candi- function Aggr()is called stable provided that h1,...,hB. date gradients h;are added by a same vector h',the aggre- h,,hB∈R4,letting6=(∑8B1lh-hl2)克,we gated gradient will also be added by h'.Property (b)says have: that for each coordinate j,the aggregated value Gi will be between the (g+1)-th smallest value and the (g+1)-th llAggr(h1,...,hB)-Aggr(h1,...,hB)<6. largest value among the j-th coordinates of all candidate gradients.Thus,the gradient aggregated by a g-BR func- If Aggr()is a stable aggregation function,it means that tion is insensitive to at least g outliers.We can find that when there is a disturbance with L2-norm o on buffers,the q-BR condition gets stronger when g increases.Namely,if disturbance of aggregated result will not be larger than 6. Aggr()is q-BR,then for any 0<q'<q.Aggr()is also Definition 5(Effective aggregation function).When there q'-BR. are at most r Byzantine workers,stable aggregation func- Remark 1.When B>1,mean function is not q-Byzantine tion Aggr()is called an (A1,A2)-effective aggregation Robust for any q>.We illustrate this by a one-dimension function,provided that it satisfies the following two proper- example:h1,·,hB-1∈[0,1,and hB=l0×B.Then ties for all wt E Rd in cases without delay (=0,Vt= 合∑,hb≥0=10生[0,l刂Namely.the mean is 0,1,.,T-1
BASGD: Buffered Asynchronous SGD for Byzantine Learning when updating parameter. Thus, a more reliable (robust) gradient can be aggregated from the B gradients of buffers, if a proper aggregation function Aggr(·) is chosen. Please note that from the perspective of workers, BASGD is fully asynchronous, since a worker will immediately receive the latest parameter from the server after sending a gradient to the server, without waiting for other workers. Meanwhile, from the perspective of server, BASGD is semiasynchronous because the server will not update the model until all buffers are filled. However, it is a necessity to limit the updating frequency in ABL when server has no instances. If the server always updates the model when receiving a gradient, it will be easily foiled when Byzantine workers send gradients much more frequently than others. A similar conclusion has been proved in previous works (Damaskinos et al., 2018). 3.2. Aggregation Function When a SGD step is ready to be executed, there are B buffers providing candidate gradients. An aggregation function is needed to get the final gradient for updating. A naive way is to take the mean of all candidate gradients. However, mean value is sensitive to outliers which are common in BL. For designing proper aggregation functions, we first define the q-Byzantine Robust (q-BR) condition to quantitatively describe the Byzantine resilience ability of an aggregation function. Definition 1 (q-Byzantine Robust). For an aggregation function Aggr(·): Aggr([h1, . . . , hB]) = G, where G = [G1, . . . , Gd] T and hb = [hb1, . . . , hbd] T , ∀b ∈ [B], we call Aggr(·) q-Byzantine Robust (q ∈ Z, 0 < q < B/2), if it satisfies the following two properties: (a). Aggr([h1+h 0 , . . . , hB +h 0 ]) = Aggr([h1, . . . , hB])+ h 0 , ∀h1, . . . , hB ∈ R d , ∀h 0 ∈ R d ; (b). mins∈S {hsj} ≤ Gj ≤ maxs∈S {hsj}, ∀j ∈ [d], ∀S ⊂ [B] with |S| = B − q. Intuitively, property (a) in Definition 1 says that if all candidate gradients hi are added by a same vector h 0 , the aggregated gradient will also be added by h 0 . Property (b) says that for each coordinate j, the aggregated value Gj will be between the (q + 1)-th smallest value and the (q + 1)-th largest value among the j-th coordinates of all candidate gradients. Thus, the gradient aggregated by a q-BR function is insensitive to at least q outliers. We can find that q-BR condition gets stronger when q increases. Namely, if Aggr(·) is q-BR, then for any 0 < q0 < q, Aggr(·) is also q 0 -BR. Remark 1. When B > 1, mean function is not q-Byzantine Robust for any q > 0. We illustrate this by a one-dimension example: h1, . . . , hB−1 ∈ [0, 1], and hB = 10 × B. Then 1 B PB b=1 hb ≥ hB B = 10 6∈ [0, 1]. Namely, the mean is larger than any of the first B − 1 values. The following two aggregation functions are both q-BR. Definition 2 (Coordinate-wise median (Yin et al., 2018)). For candidate gradients h1, h2, . . . , hB ∈ R d , hb = [hb1, hb2, . . . , hbd] T , ∀b = 1, 2, . . . , B. Coordinate-wise median is defined as: Med([h1, . . . , hB]) = [Med(h·1), . . . , Med(h·d)]T , where Med(h·j ) is the scalar median of the j-th coordinates, ∀j = 1, 2, . . . , d. Definition 3 (Coordinate-wise q-trimmed-mean (Yin et al., 2018)). For any positive interger q < B/2 and candidate gradients h1, h2, . . . , hB ∈ R d , hb = [hb1, hb2, . . . , hbd] T , ∀b = 1, 2, . . . , B. Coordinate-wise q-trimmed-mean is de- fined as: T rm([h1, . . . , hB]) = [T rm(h·1), . . . , T rm(h·d)]T , where T rm(h·j ) = 1 B−2q P b∈Mj hbj is the scalar qtrimmed-mean. Mj is the subset of {hbj} B b=1 obtained by removing the q largest elements and q smallest elements. In the following content, coordinate-wise median and coordinate-wise q-trimmed-mean are also called median and trmean, respectively. Proposition 1 shows the q-BR property of these two functions. Proposition 1. Coordinate-wise q-trmean is q-BR, and coordinate-wise median is b B−1 2 c-BR. Here, bxc represents the maximum integer that is not larger than x. According to Proposition 1, both median and trmean are proper choices for aggregation function in BASGD. The proof can be found in Appendix B of supplementary materials. Now we define another class of aggregation functions, which is also important in analysis in Section 4. Definition 4 (Stable aggregation function). Aggregation function Aggr(·) is called stable provided that ∀h1, . . . , hB, h˜ 1, . . . , h˜B ∈ R d , letting δ = (PB b=1 khb − h˜ bk 2 ) 1 2 , we have: kAggr(h1, . . . , hB) − Aggr(h˜ 1, . . . , h˜B)k ≤ δ. If Aggr(·) is a stable aggregation function, it means that when there is a disturbance with L2-norm δ on buffers, the disturbance of aggregated result will not be larger than δ. Definition 5 (Effective aggregation function). When there are at most r Byzantine workers, stable aggregation function Aggr(·) is called an (A1, A2)-effective aggregation function, provided that it satisfies the following two properties for all wt ∈ R d in cases without delay (τ t k = 0, ∀t = 0, 1, . . . , T − 1):
BASGD:Buffered Asynchronous SGD for Byzantine Learning (i).E[VF(w)TGmw]F(w)2-A1: (ii.E[Gm2 w内≤(A)2: where A1,A2are two non-negative constants,Gm Work is the gradient aggregated by Aggr()at the t-th iteration in cases without delay. For different aggregation functions,constants A and A2 Buffer_2 Buffer_3 Buffers on the server may differ.A1 and A2 are related to loss function F(),dis tribution of instances.buffer number B.maximum Byzan- tine worker number r.Inequalities(i)and (ii)in Definition 5 are two important properties in convergence proof of syn- chronous Byzantine learning methods.As revealed in (Yang 14 Workers et al.,2020),there are many existing aggregation rules for Byzantine learning.We find that most of them satisfy Defini- tion 5.For example,Krum,median,and trimmed-mean have already been proved to satisfy these two properties(Blan- Buffer_2 Buffers on uffer 3 uffer the server chard et al..2017:Yin et al..2018).SignSGD(Bernstein et al.,2019)can be seen as a combination of 1-bit quanti- zation and median aggregation,while median satisfies the Figure 2.An example of buffer reassignment.White circle repre- properties in Definition 5. sents active worker,and grey circle represents unresponsive worker. Before reassignment,buffer_0 is a straggler.After reassignment, Compared to Definition 1,Definition 5 can be used to obtain there are at least one active worker corresponding to each buffer. a tighter bound with respect to A1 and A2.However,it usually requires more effort to check the two properties in Definition 5 than those in Definition 1. Please note that too large B could lower the updating fre- 1).When reassigning buffers,the server only needs to mod- quency and damage the performance,while too small B ify the mapping table,and then stores worker-s's may harm the Byzantine-resilience.Thus,a moderate B is gradients in buffer-{B,mod B},instead of buffer_fs mod usually preferred.In some practical applications,we could B}any more.Please note that the server only needs to estimate the maximum number of Byzantine workers r,and modify the mapping table for buffer reassignment,and there set B to make the aggregation function resilient to up to is no need to notify workers. r Byzantine workers.In particular,B is suggested to be (2r+1)for median,since median is ]-BR. Besides,a timer is used on the server for indicating when to reassign buffers.The timer is started at the beginning 3.3.Mapping Table of BASGD,and is restarted immediately after each SGD step or buffer reassignment.When the timer exceeds A sec- At each iteration of BASGD.buffer_b needs at least one onds,buffers will be zeroed out,and reassignment executed. gradient for aggregation.In the worst case,all the workers Hyper-parameter A should be set properly.If A is too corresponding to buffer_b may be unresponsive.In this case, small,buffers will be zeroed out too frequently,which may buffer_b will become the straggler,and slow down the whole slow down the learning process.If A is too large,straggler learning process.To deal with this problem,we introduce buffers could not be eliminated in time the mapping table for buffer reassignment technique. We call a worker active worker if it has responsed at the 4.Convergence current iteration.If SGD step has not been excuted for A seconds,the server immediately zeroes out stored gradients In this section,we theoretically prove the convergence and in all buffers,equally reassigns active workers to each buffer, resilience of BASGD against failure or attack.There are and then continues the learning procedure.Hyper-parameter two main theorems.The first theorem presents a relatively loose but general bound for all g-BR aggregation functions. A is called reassignment interval.Figure 2 illustrates an example of reassignment.The grey circles represents unre- The second one presents a relatively tight bound for each sponsive workers.After reassignment,there are at least one distinct(A1,A2)-effective aggregation function.Since the active worker corresponding to each buffer. definition of(A1,A2)-effective aggregation function is usu- ally more difficult to verify than g-BR property,the general Specifically,we introduce a mapping tablefor bound is also useful.Here we only present the results. buffer reassignment.Initially,Bs =s(Vs =0,1,...,m- Proof details are in Appendix B of supplementary materials
BASGD: Buffered Asynchronous SGD for Byzantine Learning (i). E[∇F(wt ) T Gt syn | wt ] ≥ k∇F(wt )k 2 − A1; (ii). E[kGt synk 2 | wt ] ≤ (A2) 2 ; where A1, A2 ∈ R+ are two non-negative constants, Gt syn is the gradient aggregated by Aggr(·) at the t-th iteration in cases without delay. For different aggregation functions, constants A1 and A2 may differ. A1 and A2 are related to loss function F(·), distribution of instances, buffer number B, maximum Byzantine worker number r. Inequalities (i) and (ii) in Definition 5 are two important properties in convergence proof of synchronous Byzantine learning methods. As revealed in (Yang et al., 2020), there are many existing aggregation rules for Byzantine learning. We find that most of them satisfy Definition 5. For example, Krum, median, and trimmed-mean have already been proved to satisfy these two properties (Blanchard et al., 2017; Yin et al., 2018). SignSGD (Bernstein et al., 2019) can be seen as a combination of 1-bit quantization and median aggregation, while median satisfies the properties in Definition 5. Compared to Definition 1, Definition 5 can be used to obtain a tighter bound with respect to A1 and A2. However, it usually requires more effort to check the two properties in Definition 5 than those in Definition 1. Please note that too large B could lower the updating frequency and damage the performance, while too small B may harm the Byzantine-resilience. Thus, a moderate B is usually preferred. In some practical applications, we could estimate the maximum number of Byzantine workers r, and set B to make the aggregation function resilient to up to r Byzantine workers. In particular, B is suggested to be (2r + 1) for median, since median is b B−1 2 c-BR. 3.3. Mapping Table At each iteration of BASGD, buffer b needs at least one gradient for aggregation. In the worst case, all the workers corresponding to buffer b may be unresponsive. In this case, buffer b will become the straggler, and slow down the whole learning process. To deal with this problem, we introduce the mapping table for buffer reassignment technique. We call a worker active worker if it has responsed at the current iteration. If SGD step has not been excuted for ∆ seconds, the server immediately zeroes out stored gradients in all buffers, equally reassigns active workers to each buffer, and then continues the learning procedure. Hyper-parameter ∆ is called reassignment interval. Figure 2 illustrates an example of reassignment. The grey circles represents unresponsive workers. After reassignment, there are at least one active worker corresponding to each buffer. Specifically, we introduce a mapping table {βs} m−1 s=0 for buffer reassignment. Initially, βs = s (∀s = 0, 1, . . . , m − 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Buffer_0 Buffer_1 Buffer_2 Buffer_3 Buffer_4 Workers Buffers on the server 1 3 4 7 8 9 11 14 0 2 5 6 10 12 13 Buffer_0 Buffer_1 Buffer_2 Buffer_3 Buffer_4 Workers Buffers on the server Figure 2. An example of buffer reassignment. White circle represents active worker, and grey circle represents unresponsive worker. Before reassignment, buffer 0 is a straggler. After reassignment, there are at least one active worker corresponding to each buffer. 1). When reassigning buffers, the server only needs to modify the mapping table {βs} m−1 s=0 , and then stores worker s’s gradients in buffer {βs mod B}, instead of buffer {s mod B} any more. Please note that the server only needs to modify the mapping table for buffer reassignment, and there is no need to notify workers. Besides, a timer is used on the server for indicating when to reassign buffers. The timer is started at the beginning of BASGD, and is restarted immediately after each SGD step or buffer reassignment. When the timer exceeds ∆ seconds, buffers will be zeroed out, and reassignment executed. Hyper-parameter ∆ should be set properly. If ∆ is too small, buffers will be zeroed out too frequently, which may slow down the learning process. If ∆ is too large, straggler buffers could not be eliminated in time. 4. Convergence In this section, we theoretically prove the convergence and resilience of BASGD against failure or attack. There are two main theorems. The first theorem presents a relatively loose but general bound for all q-BR aggregation functions. The second one presents a relatively tight bound for each distinct (A1, A2)-effective aggregation function. Since the definition of (A1, A2)-effective aggregation function is usually more difficult to verify than q-BR property, the general bound is also useful. Here we only present the results. Proof details are in Appendix B of supplementary materials