observing system which is designed to obtain consistent and where complete observation on the vector y. Now let us present the last assumption. S(y(i))=a2, (12) (A.3)Observability:The observations formulated by Equa- tion (1)is distributedly observable defined by Definition 1. ∑= -ahL⑧EM++专EuIW: (13) 1)Unbiasedness and consistency of DRC:In this part,we show the unbiasedness and the consistency of DRC algorithm, and and we provide two theorems to illustrate them respectively. S0=孔S7T (14) Theorem 1.Consider the DRC algorithm is under the Especially,the record sequence uk(i)}izo at any node k is assumptions A.1-A.3 (Section IIl-A),the record sequence asymptotically normal: uk(i)}zo at node k is asymptotic unbiased V何(u()-y)→N(0,5(()) 1im[uk(i]=y,1≤k≤ (11) We provide the proof in Appendix B.Therefore,the error We defer the proof in Section V-A.Theorem 1 shows sequence {uk()-y}izo at each node can be regarded as the unbiasedness of the algorithm.It indicates that each being convergent to a normal distribution with a rate of node's estimation on the global data would be correct on the Up until now,we have presented asymptotic unbiasedness, average in the long run.The consistency of DRC algorithm is the consistence and the asymptotic normality of the DRC guaranteed by the following theorem. algorithm.In the next section,we present main properties of the DDA algorithm. Theorem 2.Consider the DRC algorithm is under the assumptions A.1-A.3 (Section IIl-A),the records sequence uk(i)}izo at node k is consistent B.Main Properties of DDA In this section,we prove the convergency of running average P[mu同=y,1≤k≤叫=1 (T)to the optimal parameter a*and derive the convergence rate of the DDA algorithm. We provide the proof in Appendix B.Based on Theorem 2, Now we present the following theorem. record sequence {u())1at every node,with probability 1, converges to the true vector y". Theorem 4.The random family {ok(t)o and {uk(t)o 2)Convergence rate analysis:We now analyze the conver- are generated by iteration(8)and (7),with the positive non- gence rate of the DRC algorithm via its deviation character- decreasing step-size sequence w(t)where is strongly istic.We first present a relative definition which is used to convex with respect to the normwith dual norm characterized the convergence rate of sequential process. Let the record error-lly⑧y‖be bounded by an arbitrary small constant Cerr-For anyθ'∈日and each node Definition 2.A sequence of records {u(i)}izo is asymptoti- k∈V,ve have cally normal if a positive semidefinite matrix S(y)exists and satisfies that f(0k(T),ik)-f(0",y)<OPT+NET+SAMP, lim vi(uk()-y*))→N(0M,Skk(y(i),1≤k≤n. where The matrix S(y(i))is called the asymptotic variance of the OPT=(T) observing sequence fy(i)}iz0.and Skk(y)ERMxM denotes 0)+名而 the k-th principal block of S(y(i)). 1 In the following part,we analyze the asymptotic normality NET= ∑a-4,训.】 of the DRC algorithm.Let Amin(YC EM +Ht)denote the smallest eigenvalue of [C EM +H].Recalling the noise covariance Se in (10),we present the following theorem to +(t)-(t) establish the asymptotic normality of the DRC algorithm. SAMP=LCerr, Theorem 3.Consider the DRC algorithm is under the 川 assumptions A.1-A.3 (Section IIl-A),with weight sequence (t)= 1 (a(i)}izo and [B(i)}izo that are given by a0=典0=7>0 a() Recall that T is the convexity parameter. Theorem 4 explicitly shows the difference between the for some a 0.Let the record sequence fu(i)izo be the state estimated results from the true optimality.It is bounded by sequence generated by (5).Then,for a> a value which is a sum of three types of errors:(1)The OPT we obtain error can be viewed as the optimization error;(2)the NET error is induced by various estimations of nodes;and (3)the V()(u()-1v8y*)→W(0,S(y(): SAMP error is incurred on account of the input noisy.The
6 observing system which is designed to obtain consistent and complete observation on the vector y. Now let us present the last assumption. (A.3) Observability: The observations formulated by Equation (1) is distributedly observable defined by Definition 1. 1) Unbiasedness and consistency of DRC: In this part, we show the unbiasedness and the consistency of DRC algorithm, and we provide two theorems to illustrate them respectively. Theorem 1. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A), the record sequence {uk(i)}i≥0 at node k is asymptotic unbiased lim i→∞ E[uk(i)] = y ∗ , ∀1 ≤ k ≤ |V|. (11) We defer the proof in Section V-A. Theorem 1 shows the unbiasedness of the algorithm. It indicates that each node’s estimation on the global data would be correct on the average in the long run. The consistency of DRC algorithm is guaranteed by the following theorem. Theorem 2. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A),the records sequence {uk(i)}i≥0 at node k is consistent P [ lim i→∞ uk(i) = y ∗ , ∀1 ≤ k ≤ |V|] = 1. We provide the proof in Appendix B. Based on Theorem 2, record sequence {uk(i)}i≥1 at every node, with probability 1, converges to the true vector y ∗ . 2) Convergence rate analysis: We now analyze the convergence rate of the DRC algorithm via its deviation characteristic. We first present a relative definition which is used to characterized the convergence rate of sequential process. Definition 2. A sequence of records {u(i)}i≥0 is asymptotically normal if a positive semidefinite matrix S(y) exists and satisfies that lim i→∞ √ i(uk(i) − y ∗ ) → N (0M, Skk(y(i))), ∀1 ≤ k ≤ n. The matrix S(y(i)) is called the asymptotic variance of the observing sequence {y(i)}i≥0, and Skk(y) ∈ RM×M denotes the k-th principal block of S(y(i)). In the following part, we analyze the asymptotic normality of the DRC algorithm. Let λmin(γL ⊗ EM + H˜) denote the smallest eigenvalue of [γL ⊗ EM + H˜]. Recalling the noise covariance Sϵ in (10), we present the following theorem to establish the asymptotic normality of the DRC algorithm. Theorem 3. Consider the DRC algorithm is under the assumptions A.1-A.3 (Section III-A), with weight sequence {α(i)}i≥0 and {β(i)}i≥0 that are given by α(i) = a i + 1 , lim i→∞ α(i) β(i) = γ > 0, for some a > 0. Let the record sequence {u(i)}i≥0 be the state sequence generated by (5). Then, for a > 1 2λmin(γL⊗EM+H˜) , we obtain √ (i)(u(i) − 1|V| ⊗ y ∗ ) =⇒ N (0, S(y(i))), where S(y(i)) = a 2 ∫ ∞ 0 e ΣvS0e Σv dv, (12) Σ = −a[γL ⊗ EM + H˜] + 1 2 EM|V|, (13) and S0 = H¯SϵH¯T . (14) Especially, the record sequence {uk(i)}i≥0 at any node k is asymptotically normal: √ (i)(uk(i) − y ∗ ) =⇒ N (0, Skk(y(i))). We provide the proof in Appendix B. Therefore, the error sequence {uk(i) − y ∗}i≥0 at each node can be regarded as being convergent to a normal distribution with a rate of √ 1 i . Up until now, we have presented asymptotic unbiasedness, the consistence and the asymptotic normality of the DRC algorithm. In the next section, we present main properties of the DDA algorithm. B. Main Properties of DDA In this section, we prove the convergency of running average θˆ k(T) to the optimal parameter θ ∗ and derive the convergence rate of the DDA algorithm. Now we present the following theorem. Theorem 4. The random family {θk(t)}∞ t=0 and {µk (t)}∞ t=0 are generated by iteration (8) and (7), with the positive nondecreasing step-size sequence {ω(t)}∞ t=0, where ϕ is strongly convex with respect to the norm ∥·∥ with dual norm ∥·∥∗ . Let the record error E[yˆk ] − 1|V| ⊗ y ∗ be bounded by an arbitrary small constant Cerr. For any θ ∗ ∈ Θ and each node k ∈ V, we have f(θˆ k(T), yˆk ) − f(θ ∗ , y ∗ ) ≤ OPT + NET + SAMP, where OP T = ω(T) T ϕ(θ ∗ ) + L 2 2T τ ∑ T t=1 1 ω(t) , NET = L T ∑ T t=1 1 ω(t) E [ 2 |V| ∑ |V| j=1 µ¯(t) − µj (t) ∗ + ∥µ¯(t) − µk (t)∥ ] , SAMP = LCerr, µ¯(t) = 1 |V| ∑ |V| k=1 µk (t). Recall that τ is the convexity parameter. Theorem 4 explicitly shows the difference between the estimated results from the true optimality. It is bounded by a value which is a sum of three types of errors: (1) The OPT error can be viewed as the optimization error; (2) the NET error is induced by various estimations of nodes; and (3) the SAMP error is incurred on account of the input noisy. The
theorem indicates the relationship between the difference and First,since a(① B() →Y,we have T,which will help us understand the convergency of the DDA algorithm.The detailed proof will be given in Section V. (20) We next investigate the relationship between the conver- 3鍋≤2折≥n gence rates and the spectral property of the network.For Second,let Amin(C EM+H)be the smallest eigenvalue a given graph g,we assume that communications between of the positive definite matrix cEM+H].Since a(i) nodes are controlled by a double stochastic matrix P.In the 0,we have following,we show that the spectral gap of the network,ie., Y(P)=1-02(P)of P severely influences the convergence 3i23:a()≤ .i≥i2(21) rate of DDA,where o2(P)is the second largest singular value 入min(yC⑧EM+孔) of P. Third,.the other facts include::l)λmin(A+B)≥ Theorem 5.Following Theorem 4 and recalling that (0*)< Amin(A)+Amin(B)(Courant-Fischer Minimax Theorem [41]), A2,if we define the step-size w(t)and the record error 2)入min(C⑧EM)=入min(C)≥0. E[9]-1v⑧y*‖as: Based on above facts.the multiplicand of Equation (19) follows from (21),for each j>io 0=A/i md C恶T I‖EMIy-a(J)C⑧EM-B(U)孔 we will have f6e(①,0-f0,g)s16t2n(T@ 一Em~6器coL+元 AVT1-o2(P,for allk∈y =1-B(j)λmim ac⑧EM+0 3)1 We defer the proof in Section V-C.Theorem 5 shows that the convergence rate of distributed subgradient methods -=1-6以n《(0-3c@Eu+3c@Ev+0 heavily relies on the graph spectral property.The dependence on the spectral quantity 1-02(P)is quite natural,since lots of ≤1-UAm(c®Ew+列. work have noticed that the propagation of information severely (22) relies on the spectral property of the underlying network. From (19)and (22),we now have for each i>io, As we have presented all main properties of our algorithms, we will next turn to the detailed proof of each theorem. u(-1‖≤ IΠ1-BU)入mia(C®Eu+) V.PROOF OF THEOREMS ×[u(io】-1v⑧y]l (23) A.Proof of Theorem I Proof:Taking the expectation of both sides of Eg.(5),it Finally,from the inequality 1-a≤e-a,0≤a≤l,we get follows i-1 [u(i+1】=E[u(]-a(i)(C⑧EM)E[u(] Ieu(】-lw⑧l‖≤exp-Amia(Gc⑧Eu+0)∑Bj) (15) j=io +B(i)孔E[y(i】-B()孔孔Eu()] ×u(io】-1y8y],i>io. Given that (24) (CEM1⑧y)=0IvIM, (16) With the facts that Amin(EM+H)>0 and the sum (1w⑧y)=孔Ey(, (17) of B(j)approaches to infinity,we have subtracting both sides of Eq.(15)by 1vy,we have 1imE[u()]-1v8y*‖=0. Eu(i+1】-1y⑧y=[EyM-a()C⑧EM Thus we complete the proof. (18) -B()Eu(]-1⑧y]. B.Proof of Theorem 4 Continuing the iteration in (18),we have,for each i>io maxfi1,i2), Before proving the theorem of algorithm convergency,we present here some basic assumptions and necessary lemmas. (A.4)A prox-function e-R exists to be T-strongly E[u(-1w1⑧y‖ -a(Gj)C⑧EM convex with respect to the norm l,i.e., p(01)≥(02)+(Wgo(02),01-02)+5I01-02P, -) ×E[z(io】-1y⑧y]. ford1,02∈日.Function is non-negative over日and(0)= (19)0.The prox-center of e is given by 0o arg mine[o():0 E To further derive the above formulation,we have the fol- Due to the page limit.we omit the proof of the positive semidefiniteness lowing facts. of the matrix
7 theorem indicates the relationship between the difference and T, which will help us understand the convergency of the DDA algorithm. The detailed proof will be given in Section V. We next investigate the relationship between the convergence rates and the spectral property of the network. For a given graph G, we assume that communications between nodes are controlled by a double stochastic matrix P. In the following, we show that the spectral gap of the network, i.e., γ(P) = 1 − σ2(P) of P severely influences the convergence rate of DDA, where σ2(P) is the second largest singular value of P. Theorem 5. Following Theorem 4 and recalling that ϕ(θ ∗ ) ≤ A2 , if we define the step-size ω(t) and the record error E[yˆk ] − 1|V| ⊗ y ∗ as: ω(t) = A √ t and Cerr = 2L A √ T · ln(T √ |V|) 1 − σ2(P) , we will have f(θˆ k(T), y) − f(θ, y ∗ ) ≤ 16L 2 A √ T ln(T √ |V|) 1 − σ2(P) , for all k ∈ V. We defer the proof in Section V-C. Theorem 5 shows that the convergence rate of distributed subgradient methods heavily relies on the graph spectral property. The dependence on the spectral quantity 1−σ2(P) is quite natural, since lots of work have noticed that the propagation of information severely relies on the spectral property of the underlying network. As we have presented all main properties of our algorithms, we will next turn to the detailed proof of each theorem. V. PROOF OF THEOREMS A. Proof of Theorem 1 Proof: Taking the expectation of both sides of Eq. (5), it follows E[u(i + 1)] = E[u(i)] − α(i)(L ⊗ EM)E[u(i)] + β(i)H¯E[y(i)] − β(i)H¯H¯T E[u(i)]. (15) Given that (L ⊗ EM)(1|V| ⊗ y ∗ ) = 0|V|M, (16) H˜(1|V| ⊗ y ∗ ) = H¯E[y(i)], (17) subtracting both sides of Eq. (15) by 1|V| ⊗ y ∗ , we have E[u(i + 1)] − 1|V| ⊗ y ∗ = [E|V|M − α(i)L ⊗ EM − β(i)H˜][E[u(i)] − 1|V| ⊗ y ∗ ]. (18) Continuing the iteration in (18), we have, for each i ≥ i0 = max{i1, i2}, E[u(i)] − 1|V| ⊗ y ∗ ≤ ( i ∏−1 j=i0 EM|V| − α(j)L ⊗ EM −β(j)H˜ ) × E[x(i0)] − 1|V| ⊗ y ∗ ] . (19) To further derive the above formulation, we have the following facts. First, since α(i) β(i) → γ, we have ∃i1 ∋: γ 2 ≤ α(i) β(i) ≤ 2γ, ∀i ≥ i1. (20) Second, let λmin(γL⊗EM +H˜) be the smallest eigenvalue of the positive definite matrix4 [γL ⊗EM +H˜]. Since α(i) → 0, we have ∃i2 ∋: α(i) ≤ 1 λmin(γL ⊗ EM + H˜) .∀i ≥ i2 (21) Third, the other facts include: 1) λmin(A + B) ≥ λmin(A)+λmin(B) (Courant-Fischer Minimax Theorem [41]), 2) λmin(L ⊗ EM) = λmin(L) ≥ 0. Based on above facts, the multiplicand of Equation (19) follows from (21), for each j ≥ i0 ||EM|V| − α(j)L ⊗ EM − β(j)H|| ˜ = EM|V| − β(j)(α(j) β(j) L ⊗ EM + H˜) = 1 − β(j)λmin( α(j) β(j) L ⊗ EM + H˜) = 1 − β(j)λmin((α(j) β(j) − γ 2 )L ⊗ EM + γ 2 L ⊗ EM + H˜) ≤ 1 − β(j)λmin( γ 2 L ⊗ EM + H˜). (22) From (19) and (22), we now have for each i > i0, E[u(i)] − 1|V| ⊗ y ≤ ( i∏−1 j=i0 (1 − β(j)λmin( γ 2 L ⊗ EM + H˜))) × E[u(i0)] − 1|V| ⊗ y ∗ ] . (23) Finally, from the inequality 1 − a ≤ e −a , 0 ≤ a ≤ 1, we get E[u(i)] − 1|V| ⊗ y ∗ ≤ exp [ −λmin( γ 2 L ⊗ EM + H˜) ∑i−1 j=i0 β(j) ] × E[u(i0)] − 1|V| ⊗ y ∗ ] , i > i0. (24) With the facts that λmin(γL ⊗ EM + H˜) > 0 and the sum of β(j) approaches to infinity, we have lim i→∞ E[u(i)] − 1|V| ⊗ y ∗ = 0. Thus we complete the proof. B. Proof of Theorem 4 Before proving the theorem of algorithm convergency, we present here some basic assumptions and necessary lemmas. (A.4) A prox-function ϕ : Θ → R exists to be τ -strongly convex with respect to the norm ∥·∥, i.e., ϕ(θ1) ≥ ϕ(θ2) + ⟨∇θϕ(θ2), θ1 − θ2⟩ + τ 2 ∥θ1 − θ2∥ 2 , for θ1, θ2 ∈ Θ. Function ϕ is non-negative over Θ and ϕ(0) = 0. The prox-center of Θ is given by θ0 = arg minθ{ϕ(θ) : θ ∈ 4Due to the page limit, we omit the proof of the positive semidefiniteness of the matrix