Maximum-a-posteriori (MAP)inference:find the maximum-a-posteriori (MAP)probabilities P()for the configurations over A,where VoA∈Q,PAr(GA)兰maxμAUB,I(aA,TB. tB∈QB All these fundamental inference problems can be described abstractly by a function 0:RK.For instances,for marginal inference,contains all MRF instances where A is a subset of the vertices, K=1,and ()=(A.(and for posterior or MAP inference,contains all MRF instances where AB is a subset of the vertices,K=Ql and 0(T)=(HA.r(A TB)(for posterior inference)or()=(PA(A(for MAP inference). One canonical approach for probabilistic inference is by sampling:sufficiently many independent samples are drawn(approximately)from the Gibbs distribution of the MRF instance and an estimate of the target probabilities is calculated from these samples.Given a probabilistic inference problem 0(),we use ()to denote an estimating function that approximates (T)using independent samples drawn approximately from ur.For the aforementioned problems of marginal,posterior and MAP inferences,such estimating function Se()simply counts the frequency of the samples that satisfy certain properties. The sampling cost of an estimator is captured in two aspects:the number of samples it uses and the accuracy of each individual sample it requires. Definition 2.1((N,e)-estimator for 0).Let 0:RK be a probabilistic inference problem and o()an estimating function for ()that for each instance I=(V,E,,)e,maps samples in v to an estimate of0(I).LetN:Nt→N+and e:Nt→(0,1).For any instance I=(W,E,Q,Φ)∈ where n =IVI,the random variable e(x(1),...,X(N(m))is said to be an (N,e)-estimator for ( if x(1),...,X(N(n))Ov are N(n)independent samples drawn approximately from ur such that drv(X,μr)≤e(n)for all1≤j≤N(n. 2.3.Dynamic inference problem.We consider the inference problem where the input graphical model is changed dynamically:at each step,the current MRF instance I =(V,E,O,)is updated to a new instance I'=(V,E',O,')We consider general update operations for MRFs that can change both the underlying graph and all edge/vertex potentials simultaneously,where the update request is made by a non-adaptive adversary independently of the randomness used by the inference algorithm. Such updates are general enough and cover many applications,e.g.analyses of time series network data [CW+07,QS93,QS92,AQ+17],and learning algorithms for graphical models [Hin12,WJ08]. The difference between the original and the updated instances is measured as follows. Definition 2.2(difference between MRF instances).The difference between two MRF instances I=(V,E,Q,Φ)andT'=(W',E',Q,Φ'),whereΦ=(pa)aEVUE andΦ'=(pa)aeVUE',is defined as (2) dI,I)± ∑lp。-9l+∑m-l,+V®1+E@E vEVnV' eEEnE' where AB=(A\B)U(B\A)stands for the symmetric difference between two sets A and B, pu-%l,兰∑ceQ 1ou(c)-φ%(c外and e-l1兰∑c,ceQ e(c,c)-pc,c Given a probability vector specified by the function 0:RK,the dynamic inference problem asks to maintain an estimator 0(T)of 0(T)for the current MRF instance I =(V,E,O,)i,with a data structure,such that when I is updated to T'=(V',E',')the algorithm updates (T) to an estimator 0(')of the new vector 0(),or equivalently,outputs the difference between the estimators (I)and (I'). It is desirable to have the dynamic inference algorithm which maintains an(N,e)-estimator for 0(T) for the current instance I.However,the dynamic algorithm cannot be efficient if N(n)and e(n)change drastically with n(so that significantly more samples or substantially more accurate samples may be needed when a new vertex is added),or if recalculating the estimating function S()itself is expensive. We introduce a notion of dynamical efficiency for the estimators that are suitable for dynamic inference
• Maximum-a-posteriori (MAP) inference: find the maximum-a-posteriori (MAP) probabilities P ∗ A,I (·) for the configurations over A, where ∀σA ∈ Q A , P ∗ A,I (σA) ≜ max τB ∈QB µA∪B,I(σA, τB). All these fundamental inference problems can be described abstractly by a function θ : M → R K . For instances, for marginal inference, M contains all MRF instances where A is a subset of the vertices, K = |Q| |A| , and θ(I) = (µA,I(σA))σA ∈QA ; and for posterior or MAP inference, M contains all MRF instances where A ] B is a subset of the vertices, K = |Q| |A| and θ(I) = (µA,I(σA | τB))σA ∈QA (for posterior inference) or θ(I) = (P ∗ A,I (σA))σA ∈QA (for MAP inference). One canonical approach for probabilistic inference is by sampling: sufficiently many independent samples are drawn (approximately) from the Gibbs distribution of the MRF instance and an estimate of the target probabilities is calculated from these samples. Given a probabilistic inference problem θ(·), we use Eθ (·) to denote an estimating function that approximates θ(I) using independent samples drawn approximately from µI. For the aforementioned problems of marginal, posterior and MAP inferences, such estimating function Eθ (·) simply counts the frequency of the samples that satisfy certain properties. The sampling cost of an estimator is captured in two aspects: the number of samples it uses and the accuracy of each individual sample it requires. Definition 2.1 ((N, ϵ)-estimator for θ). Let θ : M → R K be a probabilistic inference problem and Eθ (·) an estimating function for θ(·) that for each instance I = (V, E,Q, Φ) ∈ M, maps samples in Q V to an estimate of θ(I). Let N : N + → N + and ϵ : N + → (0, 1). For any instance I = (V, E,Q, Φ) ∈ M where n = |V |, the random variable Eθ (X (1) , . . . ,X (N (n))) is said to be an (N, ϵ)-estimator for θ(I) if X (1) , . . . ,X (N (n)) ∈ Q V are N(n) independent samples drawn approximately from µI such that dTV X (j) , µI ≤ ϵ(n) for all 1 ≤ j ≤ N(n). 2.3. Dynamic inference problem. We consider the inference problem where the input graphical model is changed dynamically: at each step, the current MRF instance I = (V, E,Q, Φ) is updated to a new instance I 0 = (V 0 , E 0 ,Q, Φ 0 ). We consider general update operations for MRFs that can change both the underlying graph and all edge/vertex potentials simultaneously, where the update request is made by a non-adaptive adversary independently of the randomness used by the inference algorithm. Such updates are general enough and cover many applications, e.g. analyses of time series network data [CW+ 07, QS93, QS92, AQ+ 17], and learning algorithms for graphical models [Hin12, WJ08]. The difference between the original and the updated instances is measured as follows. Definition 2.2 (difference between MRF instances). The difference between two MRF instances I = (V, E,Q, Φ) and I 0 = (V 0 , E 0 ,Q, Φ 0 ), where Φ = (φa)a ∈V ∪E and Φ 0 = (φ 0 a )a ∈V 0∪E 0, is defined as (2) d(I, I 0 ) ≜ Õ v ∈V ∩V 0 φv − φ 0 v 1 + Õ e ∈E∩E 0 φe − φ 0 e 1 + |V ⊕ V 0 | + |E ⊕ E 0 |, where A ⊕ B = (A \ B) ∪ (B \ A) stands for the symmetric difference between two sets A and B, φv − φ 0 v 1 ≜ Í c ∈Q φv (c) − φ 0 v (c) , and φe − φ 0 e 1 ≜ Í c,c 0∈Q φe (c,c 0 ) − φ 0 e (c,c 0 ) . Given a probability vector specified by the function θ : M → R K , the dynamic inference problem asks to maintain an estimator ˆθ(I) of θ(I) for the current MRF instance I = (V, E,Q, Φ) ∈ M, with a data structure, such that when I is updated to I 0 = (V 0 , E 0 ,Q, Φ 0 ) ∈ M, the algorithm updates ˆθ(I) to an estimator ˆθ(I0 ) of the new vector θ(I0 ), or equivalently, outputs the difference between the estimators ˆθ(I) and ˆθ(I0 ). It is desirable to have the dynamic inference algorithm which maintains an (N, ϵ)-estimator for θ(I) for the current instance I. However, the dynamic algorithm cannot be efficient if N(n) and ϵ(n) change drastically with n (so that significantly more samples or substantially more accurate samples may be needed when a new vertex is added), or if recalculating the estimating function Eθ (·) itself is expensive. We introduce a notion of dynamical efficiency for the estimators that are suitable for dynamic inference. 4
Definition 2.3 (dynamical efficiency).Let NN+N+and e:N+(0,1).Let S()be an estimating function for some K-dimensional probability vector of MRF instances.An tuple(N.e.S)is said to be dynamically efficient if it satisfies: .(bounded difference)there exist constants C1,C2>0 such that for any nE N, IN0a+1)-N(mI sC:N@and le(n+1D)-cnl≤·em n n .(small incremental cost)there is a deterministic algorithm that maintains (X(1,...,X(m)) using (mn+K).polylog(mn)bits where x(),...,(m)OV and n =IVI,such that when X()....x(m)Qv are updated to Y(),....Y(m)Qv,where n'=IV'l,the algorithm updates (X....m)to (Y....Y(m)within time cost D.polylog(mm'nn)+(m+ m'),where D is the size of the difference between two sample sequences defined as: (3) D兰 ∑∑1[xo)≠ro ismax{m,m'}e元v where an unassigned x((v)or Y((v)is not equal to any assigned spin. The dynamic efficiency basically asks N(),e(),and E()to have some sort of "Lipschitz"properties. To satisfy the bounded difference condition,N(n)and 1/e(n)are necessarily polynomially bounded, and can be any constant,polylogarithmic,or polynomial functions,or multiplications of such func- tions.The condition with small incremental cost also holds very commonly.In particular,it is satisfied by the estimating functions for all the aforementioned problems for the marginal,posterior and MAP inferences as long as the sets of variables have sizes Al,Bl E O(logn).We remark that the O(log n) upper bound is somehow necessary for the efficiency of inference,because otherwise the dimension of (T)itself(which is at least glAl)becomes super-polynomial in n. 3.MAIN RESULTS Let I =(V,E,be an MRF instance,where G=(V,E).Let IG(v)denote the neighborhood of v in G.For any vertex vV and any configuration(),we use()=Hv.(a)to denote the marginal distribution on v conditional on o: Vee:()=o.r(cl)exp (()+vec)(u)) ∑a∈Qexp(pu(a)+∑uerc()puu(cu,a) Due to the assumption in(1),the marginal distribution is always well-defined.The following condition is the Dobrushin-Shlosman condition [DS85a,DS85b,DS87,Hay06,DGJ08]. Condition 3.1(Dobrushin-Shlosman condition).Let I =(V,E,)be an MRF instance with Gibbs distribution u=ur.Let Ar RYx be the influence matrix which is defined as 50 Ar(u,)兰 maxa,r)eBu,drv(8,μ),{u}∈E, 0 {u,U}主E, where the maximum is taken over the set Bu.of all (o,r)O()xOc()that differ only at u, and drv()c(c)(c)is the total variation distance between g and An MRF instance I is said to satisfy the Dobrushin-Shlosman condition if there is a constant 8>0 such that m∑Aru)s1-& vEV Our main theorem assumes the following setup:Let 0:RK be a probabilistic inference problem that maps each MRF instance in to a K-dimensional probability vector,and let Se be its estimating function.LetN:N→N+and e:Nt→(O,l).We use I=(V,E,Q,Φ)∈gr,where n =IVI,to denote the current instance and I'=(V',E',O,'),where n'=V'l,to denote the updated instance
Definition 2.3 (dynamical efficiency). Let N : N + → N + and ϵ : N + → (0, 1). Let E(·) be an estimating function for some K-dimensional probability vector of MRF instances. An tuple (N, ϵ, E) is said to be dynamically efficient if it satisfies: • (bounded difference) there exist constants C1,C2 > 0 such that for any n ∈ N + , |N(n + 1) − N(n)| ≤ C1 · N(n) n and |ϵ(n + 1) − ϵ(n)| ≤ C2 · ϵ(n) n ; • (small incremental cost) there is a deterministic algorithm that maintains E(X (1) , . . . ,X (m) ) using (mn + K) · polylog(mn) bits where X (1) , . . . ,X (m) ∈ Q V and n = |V |, such that when X (1) , . . . ,X (m) ∈ Q V are updated to Y (1) , . . . ,Y (m0 ) ∈ Q V 0 , where n 0 = |V 0 |, the algorithm updates E(X (1) , . . . ,X (m) ) to E(Y (1) , . . . ,Y (m0 ) ) within time cost D · polylog(mm0nn0 ) +O(m + m0 ), where D is the size of the difference between two sample sequences defined as: D ≜ Õ i ≤max{m,m0 } Õ v ∈V ∪V 0 1 h X (i) (v) , Y (i) (v) i (3) , where an unassigned X (i) (v) or Y (i) (v) is not equal to any assigned spin. The dynamic efficiency basically asks N(·), ϵ(·), and E(·) to have some sort of “Lipschitz” properties. To satisfy the bounded difference condition, N(n) and 1/ϵ(n) are necessarily polynomially bounded, and can be any constant, polylogarithmic, or polynomial functions, or multiplications of such functions. The condition with small incremental cost also holds very commonly. In particular, it is satisfied by the estimating functions for all the aforementioned problems for the marginal, posterior and MAP inferences as long as the sets of variables have sizes |A| , |B| ∈ O(logn). We remark that the O(logn) upper bound is somehow necessary for the efficiency of inference, because otherwise the dimension of θ(I) itself (which is at least q |A| ) becomes super-polynomial in n. 3. Main results Let I = (V, E,Q, Φ) be an MRF instance, where G = (V, E). Let ΓG (v) denote the neighborhood of v in G. For any vertex v ∈ V and any configuration σ ∈ Q ΓG (v) , we use µ σ v,I (·) = µv,I(· | σ) to denote the marginal distribution on v conditional on σ: ∀c ∈ Q : µ σ v,I (c) = µv,I(c | σ) ≜ exp φv (c) + Í u ∈ΓG (v) φuv (σu,c) Í a ∈Q exp φv (a) + Í u ∈ΓG (v) φuv (σu, a) . Due to the assumption in (1), the marginal distribution is always well-defined. The following condition is the Dobrushin-Shlosman condition [DS85a, DS85b, DS87, Hay06, DGJ08]. Condition 3.1 (Dobrushin-Shlosman condition). Let I = (V, E,Q, Φ) be an MRF instance with Gibbs distribution µ = µI. Let AI ∈ R V ×V ≥0 be the influence matrix which is defined as AI(u,v) ≜ ( max(σ ,τ )∈Bu,v dTV µ σ v , µ τ v , {u,v} ∈ E, 0 {u,v} < E, where the maximum is taken over the set Bu,v of all (σ, τ ) ∈ Q ΓG (v) × Q ΓG (v) that differ only at u, and dTV µ σ v , µ τ v ≜ 1 2 Í c ∈Q µ σ v (c) − µ τ v (c) is the total variation distance between µ σ v and µ τ v . An MRF instance I is said to satisfy the Dobrushin-Shlosman condition if there is a constant δ > 0 such that max u ∈V Õ v ∈V AI(u,v) ≤ 1 − δ . Our main theorem assumes the following setup: Let θ : M → R K be a probabilistic inference problem that maps each MRF instance in M to a K-dimensional probability vector, and let Eθ be its estimating function. Let N : N + → N + and ϵ : N + → (0, 1). We use I = (V, E,Q, Φ) ∈ M, where n = |V |, to denote the current instance and I 0 = (V 0 , E 0 ,Q, Φ 0 ) ∈ M, where n 0 = |V 0 |, to denote the updated instance. 5
Theorem 3.2(dynamic inference algorithm).Assume that (N,e,Se)is dynamically efficient,both I and I'satisfy the Dobrushin-Shlosman condition,and d(I,I')<L=o(n). There is an algorithm that maintains an (N,e)-estimator 0(I)of the probability vector 0(I)for the current MRF instance I,using O(nN(n)+K)bits,such that when I is updated to I',the algorithm updates (I)to an(N,e)-estimator (I')of(I)for the new instance I,within expected time cost o(2LN(m)+△n, where O(-)hides a polylog(n)factor,△=max{△G,△c},where△Gand△denote the maximum degree ofG=(V,E)andG'=(V',E')respectively. Typically,the difference between two MRF instances I,I'is small2,and the underlying graphs are sparse [DSOR16],that is,L,A polylog(n).In such cases,our algorithm updates the estimator within time cost O(N(n)+n),which significantly outperforms static sampling-based inference algorithms that require time cost (n'N(n'))=(nN(n))for redrawing all N(n')independent samples. Dynamic sampling.The core of our dynamic inference algorithm is a dynamic algorithm for sam- pling:Assuming the Dobrushin-Shlosman condition,the algorithm can maintain a sequence of N(n) independent samples x(1),....X(N(m)EOV that are e(n)-close to r in total variation distance,and when I is updated to I'with difference d(,I')<L o(n),the algorithm can update the maintained samples to N(n)independent samples ()(N(that are e(n)-close tor in total vari- ation distance,using a time cost O(A2LN(n)+An)in expectation.This dynamic sampling algorithm is formally described in Theorem 6.1 and is of independent interests [FVY19]. Applications on specific models.On specific models,we have the following results,where 8>0 is an arbitrary constant. model regime space cost time cost for each update Ising e2川≥1- O(nN(n)+K) O(△2LN(n)+△n hardcore 入≤福 O(nN(n)+K) O(△3LN(n)+△n q-coloring 9≥(2+8)△ O(nN(n)+K) O(△2LN(n)+△n) TABLE 1.Dynamic inference for specific models. The results for Ising model and q-coloring are corollaries of Theorem 3.2.The regime for hardcore model is better than the Dobrushin-Shlosman condition(which is),because we use the cou- pling introduced by Vigoda [Vig99]to analyze the algorithm. 4.PRELIMINARIES Total variation distance and coupling.Let u and v be two distributions over 9.The total variation distance between u and v is defined as dvu,)e}∑)-xl. 2 xED A coupling ofμand v is a joint distribution(X,Y)∈2×2 such that marginal distribution of X isμ and the marginal distribution of Y is v.The following coupling lemma is well-known. 2In multivariate time-series data analysis,the MRF instances of two sequential times are similar.In the iterative algo- rithms for learning graphical models,the difference between two sequential MRF instances generated by gradient descent are bounded to prevent oscillations.Specifically,the difference is very small when the iterative algorithm is close to con- verge [Hin12,WJ08]
Theorem 3.2 (dynamic inference algorithm). Assume that (N, ϵ, Eθ ) is dynamically efficient, both I and I 0 satisfy the Dobrushin-Shlosman condition, and d(I, I 0 ) ≤ L = o(n). There is an algorithm that maintains an (N, ϵ)-estimator ˆθ(I) of the probability vector θ(I) for the current MRF instance I, using Oe(nN(n) + K) bits, such that when I is updated to I 0 , the algorithm updates ˆθ(I) to an (N, ϵ)-estimator ˆθ(I0 ) of θ(I0 ) for the new instance I, within expected time cost Oe ∆ 2 LN(n) + ∆n , where Oe(·) hides a polylog(n) factor, ∆ = max{∆G, ∆G0}, where ∆G and ∆G0 denote the maximum degree of G = (V, E) and G 0 = (V 0 , E 0 ) respectively. Typically, the difference between two MRF instances I, I 0 is small2 , and the underlying graphs are sparse [DSOR16] , that is, L, ∆ ≤ polylog(n). In such cases, our algorithm updates the estimator within time costOe(N(n)+n), which significantly outperforms static sampling-based inference algorithms that require time cost Ω(n 0N(n 0 )) = Ω(nN(n)) for redrawing all N(n 0 ) independent samples. Dynamic sampling. The core of our dynamic inference algorithm is a dynamic algorithm for sampling: Assuming the Dobrushin-Shlosman condition, the algorithm can maintain a sequence of N(n) independent samples X (1) , . . . ,X (N (n)) ∈ Q V that are ϵ(n)-close to µI in total variation distance, and when I is updated to I 0 with difference d(I, I 0 ) ≤ L = o(n), the algorithm can update the maintained samples to N(n 0 ) independent samples Y (1) , . . . ,Y (N (n 0 )) ∈ Q V 0 that are ϵ(n 0 )-close to µI0 in total variation distance, using a time cost Oe ∆ 2LN(n) + ∆n in expectation. This dynamic sampling algorithm is formally described in Theorem 6.1 and is of independent interests [FVY19]. Applications on specific models. On specific models, we have the following results, where δ > 0 is an arbitrary constant. model regime space cost time cost for each update Ising e −2|β | ≥ 1 − 2−δ ∆+1 Oe(nN(n) + K) Oe ∆ 2LN(n) + ∆n hardcore λ ≤ 2−δ ∆−2 Oe(nN(n) + K) Oe ∆ 3LN(n) + ∆n q-coloring q ≥ (2 + δ)∆ Oe(nN(n) + K) Oe ∆ 2LN(n) + ∆n Table 1. Dynamic inference for specific models. The results for Ising model and q-coloring are corollaries of Theorem 3.2. The regime for hardcore model is better than the Dobrushin-Shlosman condition (which is λ ≤ 1−δ ∆−1 ), because we use the coupling introduced by Vigoda [Vig99] to analyze the algorithm. 4. Preliminaries Total variation distance and coupling. Let µ and ν be two distributions over Ω. The total variation distance between µ and ν is defined as dTV (µ, ν) ≜ 1 2 Õ x ∈Ω |µ(x) − ν(x)| . A coupling of µ and ν is a joint distribution (X,Y) ∈ Ω × Ω such that marginal distribution of X is µ and the marginal distribution of Y is ν. The following coupling lemma is well-known. 2 In multivariate time-series data analysis, the MRF instances of two sequential times are similar. In the iterative algorithms for learning graphical models, the difference between two sequential MRF instances generated by gradient descent are bounded to prevent oscillations. Specifically, the difference is very small when the iterative algorithm is close to converge [Hin12, WJ08]. 6
Proposition 4.1(coupling lemma).For any coupling (X,Y)ofu and v,it holds that Pr[X≠Y]≥drv(4,v). Furthermore,there is an optimal coupling that achieves equality. Local neighborhood.Let G=(V,E)be a graph.For any vertex v EV,let IG(v)fu EV I fu,v}E E}denote the neighborhood of v,andr(v)=IG(v)Ufu}the inclusive neighborhood of v.We simply write I=r(v)=IG(v)and r=r+(v)=r(v)for short when G is clear in the context.We use △=△G兰maxvev Tl to denote the maximum degree of graph G. A notion of local neighborhood for MRF is frequently used.Let I =(V,E,be an MRF instance.For vV,we denote by[]the restriction of I on the inclusive neighborhood ofv,i.e.Iu=(Tt,Ev,Q,Φu),where Ev={u,v}∈E}andΦu=(pa)aeTtUEv Gibbs sampling.The Gibbs sampling (a.k.a.heat-bath,Glauber dynamics),is a classic Markov chain for sampling from Gibbs distributions.Let I=(V,E,be an MRF instance and u =ur its Gibbs distribution.The chain of Gibbs sampling(Algorithm 1)is on the space OV,and has the stationary distribution ur [LP17,Chapter 3]. Algorithm 1:Gibbs sampling Initialization:an initial state Xo E (not necessarily feasible) 1 for t =1,2,...,T do 2 pick Ur EV uniformly at random; 3 draw a random value c O from the marginal distribution Hu (X-1T)); Xr(or)←-c and X(w←-Xi-1(u)for all u∈V\{uri Marginal distributions.Here ((T))=Ho.r(.T))denotes the marginal distribution at vV conditioning on (T)O,which is computed as: (4) pu(c)Πuer Puv(ou,c) VeQ:H()Tet ( Due to the assumption(1),this marginal distribution is always well defined,and its computation uses only the information of Coupling for mixing time.Consider a chain(X)on space with stationary distribution r for MRF instance I.The mixing rate is defined as:for e>0, rmix(Ie)max min (tdrv) where dry(Xt,ur)denotes the total variation distance between ur and the distribution of X.. A coupling of a Markov chain is a joint process(Xt,Y:)tzo such that(X)tzo and(Y:)tzo marginally follow the same transition rule as the original chain.Consider the following type of couplings. Definition 4.2(one-step optimal coupling for Gibbs sampling).A coupling(Xt,Y)tzo of Gibbs sampling on an MRF instance I =(V,E,Q,)is a one-step optimal coupling if it is constructed as follows:For t=1,2,..., (1)pick the same random vr E V,and let (Xi(u),Yi(u))(X:-1(u),Y-1(u))for all uv; (②)sample(X,(().Y(er》from an optimal coupling D8oi.,C,)of the marginal distributions Ho(Io)and uv,(IT)where o =X1-1(Tv)and T =Y-(Tv). The coupling Dop()is an optimal coupling of)and)that attains the maximum Prx=y]for all couplings (y)ofa)andy).The coupling Dop)is determined by the local information I and o,rE Odeg(). With such a coupling,we can establish the following relation between the Dobrushin-Shlosman condition and the rapid mixing of the Gibbs sampling [DS85a,DS85b,DS87,BD97,Hay06,DGJ08]. 7
Proposition 4.1 (coupling lemma). For any coupling (X,Y) of µ and ν, it holds that Pr[X , Y] ≥ dTV (µ, ν) . Furthermore, there is an optimal coupling that achieves equality. Local neighborhood. Let G = (V, E) be a graph. For any vertex v ∈ V , let ΓG (v) ≜ {u ∈ V | {u,v} ∈ E} denote the neighborhood of v, and Γ + G (v) ≜ ΓG (v)∪ {v} the inclusive neighborhood of v. We simply write Γv = Γ(v) = ΓG (v) and Γ + v = Γ + (v) = Γ + G (v) for short when G is clear in the context. We use ∆ = ∆G ≜ maxv ∈V |Γv | to denote the maximum degree of graph G. A notion of local neighborhood for MRF is frequently used. Let I = (V, E,Q, Φ) be an MRF instance. For v ∈ V , we denote by Iv ≜ I[Γ + v ] the restriction of I on the inclusive neighborhood Γ + v of v, i.e. Iv = (Γ + v , Ev,Q, Φv ), where Ev = {{u,v} ∈ E} and Φv = (φa)a ∈Γ + v ∪Ev . Gibbs sampling. The Gibbs sampling (a.k.a. heat-bath, Glauber dynamics), is a classic Markov chain for sampling from Gibbs distributions. Let I = (V, E,Q, Φ) be an MRF instance and µ = µI its Gibbs distribution. The chain of Gibbs sampling (Algorithm 1) is on the space Ω ≜ Q V , and has the stationary distribution µI [LP17, Chapter 3]. Algorithm 1: Gibbs sampling Initialization: an initial state X0 ∈ Ω (not necessarily feasible); 1 for t = 1, 2, . . . ,T do 2 pick vt ∈ V uniformly at random; 3 draw a random value c ∈ Q from the marginal distribution µvt (· | Xt−1(Γvt )); 4 Xt (vt ) ← c and Xt (u) ← Xt−1(u) for all u ∈ V \ {vt }; Marginal distributions. Here µv (· | σ(Γv )) = µv,I(· | σ(Γv )) denotes the marginal distribution at v ∈ V conditioning on σ(Γv ) ∈ Q Γv , which is computed as: ∀c ∈ Q : µv (c | σ(Γv )) = φv (c) Î u ∈Γv φuv (σu,c) Í c 0∈Q φv (c 0 ) Î u ∈Γv φuv (σu,c 0 ) (4) . Due to the assumption (1), this marginal distribution is always well defined, and its computation uses only the information of Iv . Coupling for mixing time. Consider a chain (Xt ) ∞ t=0 on space Ω with stationary distribution µI for MRF instance I. The mixing rate is defined as: for ϵ > 0, τmix(I, ϵ) ≜ max X0 min {t | dTV (Xt , µI)} , where dTV (Xt , µI) denotes the total variation distance between µI and the distribution of Xt . A coupling of a Markov chain is a joint process (Xt ,Yt )t ≥0 such that (Xt )t ≥0 and (Yt )t ≥0 marginally follow the same transition rule as the original chain. Consider the following type of couplings. Definition 4.2 (one-step optimal coupling for Gibbs sampling). A coupling (Xt ,Yt )t ≥0 of Gibbs sampling on an MRF instance I = (V, E,Q, Φ) is a one-step optimal coupling if it is constructed as follows: For t = 1, 2, . . ., (1) pick the same random vt ∈ V , and let (Xt (u),Yt (u)) ← (Xt−1(u),Yt−1(u)) for all u , vt ; (2) sample (Xt (vt ),Yt (vt )) from an optimal coupling D σ ,τ opt,Ivt (·, ·) of the marginal distributions µvt (· | σ) and µvt (· | τ ) where σ = Xt−1(Γvt ) and τ = Yt−1(Γvt ). The coupling D σ ,τ opt,Ivt (·, ·) is an optimal coupling of µvt (· | σ) and µvt (· | τ ) that attains the maximum Pr[x = y] for all couplings (x,y) of x ∼ µvt (· | σ) and y ∼ µvt (· | τ ). The coupling D σ ,τ opt,Ivt (·, ·) is determined by the local information Iv and σ, τ ∈ Q deg(v) . With such a coupling, we can establish the following relation between the Dobrushin-Shlosman condition and the rapid mixing of the Gibbs sampling [DS85a, DS85b, DS87, BD97, Hay06, DGJ08]. 7
Proposition 4.3([Hay06]).Let I =(V,E,Q,be an MRF instance with n VI,and=OV the state space.LetH(o,t)兰l{o∈V|ou≠denote the Hamming distance between o∈2andt∈2.IfI satisfies the Dobrushin-Shlosman condition(Condition 3.1)with constant 8>0,then the one-step optimal coupling(X:,Y:)tzo for Gibbs sampling(Definition 4.2)satisfies Ya,t∈Q:B[HX.Y)lX-1=oAY-1=t]≤1- H(o,), n then the mixing rate of Gibbs sampling on I is bounded as rmix(I,e)log 5.OUTLINES OF ALGORITHM Let 0:RK be a probabilistic inference problem that maps each MRF instance in i to a K-dimensional probability vector,and let Se be its estimating function.Le I=(V,E,O,be the current instance,where n=V.Our dynamic inference algorithm maintains a sequence of N(n) independent samples.(N(which are (n)-close to the Gibbs distribution in total variation distance and an(N,e)-estimator (of (T)such that T)=8g(Xm,X②,.,Xm川 Upon an update request that modifies I to a new instance I'=(V,E,,')where n'=V', our algorithm does the followings: Update the sample sequence.Update x(1),...,X(N(m))to a new sequence of N(n')independent samples Y(,...,Y(N(that aree(n)-close to ur,in total variation distance,and output the difference between two sample sequences. Update the estimator.Given the difference between the two sample sequences,update (T)to 0()=Se(Y(1),...,Y(N(n))by accessing the oracle in Definition 2.3. Obviously,the updated estimator 0()is an(N,e)-estimator for 0(). Our main technical contribution is to give an algorithm that dynamically maintains a sequence of N(n)independent samples for ur,while I itself is dynamically changing.The dynamic sampling problem was recently introduced in [FVY19].The dynamical sampling algorithm given there only handles update of a single vertex or edge and works only for graphical models with soft constraints. In contrast,our dynamic sampling algorithm maintains a sequence of N(n)independent samples for ur within total variation distance e(n),while the entire specification of the graphical model I is subject to dynamic update (to a new I'with difference d(,I')<L o(n)).Specifically,the algorithm updates the sample sequence within expected time O(A2N(n)L log'n An).Note that the extra O(An)cost is necessary for just editing the current MRF instance I to I'because a single update may change all the vertex and edge potentials simultaneously.This incremental time cost dominates the time cost of the dynamic inference algorithm,and is efficient for maintaining N(n)independent samples,especially when N(n)is sufficiently large,e.g.N(n)=Q(n/L),in which case the average incremental cost for updating each sample is O(A2Llog3 n+An/N(n))=O(A2Llog3 n). We illustrate the main idea by explaining how to maintain one sample.The idea is to represent the trace of the Markov chain for generating the sample by a dynamic data structure,and when the MRF instance is changed,this trace is modified to that of the new chain for generating the sample for the updated instance.This is achieved by both a set of efficient dynamic data structures and the coupling between the two Markov chains. Specifically,let(X)obe the Gibbs sampler chain for distribution ur.When the chain is rapidly mixing,starting from an arbitrary initial configuration Xo e Ov,after suitably many steps X=Xr is an accurate enough sample for ur.At each step,X-1 and X:may differ only at a vertex vr which is picked from V uniformly and independently at random.The evolution of the chain is fully captured by the initial state Xo and the sequence of pairs (v,Xr(vr)),from t=1 to t =T,which is called the execution log of the chain in the paper. Now suppose that the current instance I is updated to I'.We construct such a coupling between the original chain(Xand the new chain (Ysuch that (Y)is a faithful Gibbs sampling chain for the updated instance I'given that (X)is a faithful chain for I,and the difference between 8
Proposition 4.3 ([Hay06]). Let I = (V, E,Q, Φ) be an MRF instance with n = |V |, and Ω = Q V the state space. Let H(σ, τ ) ≜ |{v ∈ V | σv , τv }| denote the Hamming distance between σ ∈ Ω and τ ∈ Ω. If I satisfies the Dobrushin-Shlosman condition (Condition 3.1) with constant δ > 0, then the one-step optimal coupling (Xt ,Yt )t ≥0 for Gibbs sampling (Definition 4.2) satisfies ∀σ, τ ∈ Ω : E [H(Xt ,Yt ) | Xt−1 = σ ∧ Yt−1 = τ ] ≤ 1 − δ n · H(σ, τ ), then the mixing rate of Gibbs sampling on I is bounded as τmix(I, ϵ) ≤ n δ log n ϵ . 5. Outlines of algorithm Let θ : M → R K be a probabilistic inference problem that maps each MRF instance in M to a K-dimensional probability vector, and let Eθ be its estimating function. Le I = (V, E,Q, Φ) ∈ M be the current instance, where n = |V |. Our dynamic inference algorithm maintains a sequence of N(n) independent samples X (1) , . . . ,X (N (n)) ∈ Q V which are ϵ(n)-close to the Gibbs distribution µI in total variation distance and an (N, ϵ)-estimator ˆθ(I) of θ(I) such that ˆθ(I) = Eθ (X (1) ,X (2) , . . . ,X (N (n))). Upon an update request that modifies I to a new instance I 0 = (V 0 , E 0 ,Q, Φ 0 ) ∈ M, where n 0 = |V 0 |, our algorithm does the followings: • Update the sample sequence. Update X (1) , . . . ,X (N (n)) to a new sequence of N(n 0 ) independent samplesY (1) , . . . ,Y (N (n 0 )) ∈ Q V 0 that are ϵ(n 0 )-close to µI0 in total variation distance, and output the difference between two sample sequences. • Update the estimator. Given the difference between the two sample sequences, update ˆθ(I) to ˆθ(I0 ) = Eθ (Y (1) , . . . ,Y (N (n 0 ))) by accessing the oracle in Definition 2.3. Obviously, the updated estimator ˆθ(I0 ) is an (N, ϵ)-estimator for θ(I0 ). Our main technical contribution is to give an algorithm that dynamically maintains a sequence of N(n) independent samples for µI, while I itself is dynamically changing. The dynamic sampling problem was recently introduced in [FVY19]. The dynamical sampling algorithm given there only handles update of a single vertex or edge and works only for graphical models with soft constraints. In contrast, our dynamic sampling algorithm maintains a sequence of N(n) independent samples for µI within total variation distance ϵ(n), while the entire specification of the graphical model I is subject to dynamic update (to a new I 0 with difference d(I, I 0 ) ≤ L = o(n)). Specifically, the algorithm updates the sample sequence within expected time O(∆ 2N(n)L log3 n + ∆n). Note that the extra O(∆n) cost is necessary for just editing the current MRF instance I to I 0 because a single update may change all the vertex and edge potentials simultaneously. This incremental time cost dominates the time cost of the dynamic inference algorithm, and is efficient for maintaining N(n) independent samples, especially when N(n) is sufficiently large, e.g. N(n) = Ω(n/L), in which case the average incremental cost for updating each sample is O(∆ 2L log3 n + ∆n/N(n)) = O(∆ 2L log3 n). We illustrate the main idea by explaining how to maintain one sample. The idea is to represent the trace of the Markov chain for generating the sample by a dynamic data structure, and when the MRF instance is changed, this trace is modified to that of the new chain for generating the sample for the updated instance. This is achieved by both a set of efficient dynamic data structures and the coupling between the two Markov chains. Specifically, let (Xt ) T t=0 be the Gibbs sampler chain for distribution µI. When the chain is rapidly mixing, starting from an arbitrary initial configuration X0 ∈ Q V , after suitably many steps X = XT is an accurate enough sample for µI. At each step, Xt−1 and Xt may differ only at a vertex vt which is picked from V uniformly and independently at random. The evolution of the chain is fully captured by the initial state X0 and the sequence of pairs hvt ,Xt (vt )i, from t = 1 to t = T , which is called the execution log of the chain in the paper. Now suppose that the current instance I is updated to I 0 . We construct such a coupling between the original chain (Xt ) T t=0 and the new chain (Yt ) T t=0 , such that (Yt ) T t=0 is a faithful Gibbs sampling chain for the updated instance I 0 given that (Xt ) T t=0 is a faithful chain for I, and the difference between 8