EXPANDER GRAPHS AND THEIR APPLICATIONS 459 Recall that an (n,d)-graph is a d-regular graph G on n vertices.It is called an (n,d,a)-graph if 2(G),An(G)<ad,where d =1(G)>...An(G)is the spectrum of G. A vector pE R"is called a probability distribution vector if its coordinates are nonnegative andp1.The probability vector that corresponds to the uniform distribution on {1,...,n}is denoted by u =(1,...,1)/n.In this section we show that a random walk on the vertices of an expander converges rapidly to the stationary distribution. Definition 3.1.A random walk on a finite graph G =(V,E)is a discrete-time stochastic process (Xo,X1,...)taking values in V.The vertex Xo is sampled from some initial distribution on V,and Xit1 is chosen uniformly at random from the neighbors of X;. If G is a d-regular graph with adjacency matrix A,then its normalized adjacency matrir is defined as A=A.Here are some simple comments on this random walk. The random walk on G=(V,E)is a Markov Chain with state set V and transition matrix A. A is real,symmetric,and doubly stochastic;i.e.every column and every row sums up to 1. ·Ifi1≥…≥入n are the eigenvalues of A,then入1=1 and max{li2l,lnl}≤ a. The corresponding eigenvectors are the same eigenvectors of A. Consider an experiment where we sample a vertex x from some probability distribution p on V and then move to a random neighbor of x.This is equivalent to sampling a vertex from the distribution Ap. The matrix A is the transition matrix of the Markov Chain defined by random walks of length t.In other words(A)ij is the probability a random walk starting at i is at j after t steps. The stationary distribution of the random walk on G is the uniform distri- bution,namely,uA=Au =u.(This uses the symmetry of A.) 3.1.1.Convergence in the l and l2 norms.The inner product of x,y E Rn is denoted by (x,y)=.The h,l and norms are denoted as usual by ·lxl=1, ·Ix业=V,x=V∑=1买 ·lIxlloo=max1≤i≤nz We now observe that if G is an(n,d,a)-graph and a 1,then regardless of the initial distribution p,the random walk converges in l exponentially fast to its limit (uniform)distribution.This will follow(via Cauchy-Schwartz)from a similar bound on l2,which in turn follows from the fact that in l2 the distance to the uniform distribution shrinks by a factor of a at each step. Theorem 3.2.Let G be an (n,d,a)-graph with normalized adjacency matrix A. Then for any distribution vector p and any positive integer t: llA'p-ull≤Va·a. Why use the l norm to measure for the distance between two probability distri- butions p,q?A natural and commonly used metric is the total variation distance maxB|Prp[B]-Pra[B]l,and it is not difficult to check that this equals p-ql1
EXPANDER GRAPHS AND THEIR APPLICATIONS 459 Recall that an (n,d)-graph is a d-regular graph G on n vertices. It is called an (n, d, α)-graph if |λ2(G)|, |λn(G)| ≤ αd, where d = λ1(G) ≥ ... ≥ λn(G) is the spectrum of G. A vector p ∈ Rn is called a probability distribution vector if its coordinates are nonnegative and n i=1 pi = 1. The probability vector that corresponds to the uniform distribution on {1,...,n} is denoted by u = (1,..., 1)/n. In this section we show that a random walk on the vertices of an expander converges rapidly to the stationary distribution. Definition 3.1. A random walk on a finite graph G = (V,E) is a discrete-time stochastic process (X0, X1,...) taking values in V . The vertex X0 is sampled from some initial distribution on V , and Xi+1 is chosen uniformly at random from the neighbors of Xi. If G is a d-regular graph with adjacency matrix A, then its normalized adjacency matrix is defined as Aˆ = 1 dA. Here are some simple comments on this random walk. • The random walk on G = (V,E) is a Markov Chain with state set V and transition matrix Aˆ. • Aˆ is real, symmetric, and doubly stochastic; i.e. every column and every row sums up to 1. • If λˆ1 ≥···≥ λˆn are the eigenvalues of Aˆ, then λˆ1 = 1 and max{|λˆ2|, |λˆn|} ≤ α. • The corresponding eigenvectors are the same eigenvectors of A. • Consider an experiment where we sample a vertex x from some probability distribution p on V and then move to a random neighbor of x. This is equivalent to sampling a vertex from the distribution Aˆp. • The matrix Aˆt is the transition matrix of the Markov Chain defined by random walks of length t. In other words (Aˆt )ij is the probability a random walk starting at i is at j after t steps. • The stationary distribution of the random walk on G is the uniform distribution, namely, uAˆ = Aˆu = u. (This uses the symmetry of Aˆ.) 3.1.1. Convergence in the l1 and l2 norms. The inner product of x, y ∈ Rn is denoted by x, y = n i=1 xiyi. The l1, l2 and l∞ norms are denoted as usual by • ||x||1 = n i=1 |xi|, • ||x||2 = x, x = n i=1 x2 i , • ||x||∞ = max1≤i≤n |xi|. We now observe that if G is an (n, d, α)-graph and α < 1, then regardless of the initial distribution p, the random walk converges in l1 exponentially fast to its limit (uniform) distribution. This will follow (via Cauchy-Schwartz) from a similar bound on l2, which in turn follows from the fact that in l2 the distance to the uniform distribution shrinks by a factor of α at each step. Theorem 3.2. Let G be an (n, d, α)-graph with normalized adjacency matrix Aˆ. Then for any distribution vector p and any positive integer t: ||Aˆt p − u||1 ≤ √n · αt . Why use the l1 norm to measure for the distance between two probability distributions p,q? A natural and commonly used metric is the total variation distance maxB |Prp[B] − Prq[B]|, and it is not difficult to check that this equals 1 2 ||p − q||1
460 SHLOMO HOORY,NATHAN LINIAL,AND AVI WIGDERSON In other words,if the l distance is small,then the two distributions p and q assign nearly equal probabilities to every event in the probability space.Theorem 3.2 follows immediately from the analogous l2 bound below. Theorem 3.3.Let G be an (n,d,a)-graph with normalized adjacency matrir A. Then for any distribution vector p and any positive integer t: lA'p-u2≤lp-ul2a≤a. Obviously it suffices to prove this bound for t=1(shrinkage per step)and use induction. Lemma 3.4.For every probability vector p,Ap-ull2 <allp-ull2 a Proof.The uniform distribution u is invariant under the action of A.Also,p-u is orthogonal to u and thus shrinks in l2-norm by a factor a under the action of A. Consequently IlAp-ull2=lA(p-u)2≤alp-u2≤a, where the last inequality follows easily from the fact that p is a probability distri- bution. 口 3.1.2.Convergence in entropy.Another important perspective of a random walk is offered by the entropy of the associated probability distributions.The entropy of probability distributions is a fundamental concept in the theory of communica- tion,capturing the amount of“information”,or alternatively“ncertainty”,that it carries.When we take a random step,we "inject"more randomness into our distri- bution,indeed precisely the logd random bits that are needed to specify which of the d neighbors of the current vertex we move to next.One expects this injection to increase the amount of "randomness"in the distribution,namely its entropy. This is indeed always true in every regular graph,and expanders are those graphs for which the increase is significant. This entropy viewpoint will be extremely important when we explain the zig- zag product and its use in combinatorial constructions of various expanders in Sections 9 and 10.In the same way that different norms capture different aspects of the probability distributions,there are several variations on the theme of entropy that do this.Let [n]denote the set of integers {1,...,n}.Then for a probability distribution p on n we define: ·Shannon entropy:H(p)--∑1plog(p). Renyi 2-entropy:H2(p)=-2l0g(lp2). .Min entropy:Hoo(p)=-log(llplloo). To see the connection between the last two quantities,note that if p is a proba- bility distribution on[n,then maxp:≥∑p2≥maxp.It follows that: Proposition 3.5. H(p)≤H2(P)≤2H(P) Here are some simple and useful properties that are common to all three,which the reader is invited to verify.As above,p is a probability distribution on an n-element set,and we denote a "generic"entropy by H. .H(p)>0 with equality iff the distribution is concentrated on a single element. H(p)<log n with equality iff the distribution is uniform
460 SHLOMO HOORY, NATHAN LINIAL, AND AVI WIGDERSON In other words, if the l1 distance is small, then the two distributions p and q assign nearly equal probabilities to every event in the probability space. Theorem 3.2 follows immediately from the analogous l2 bound below. Theorem 3.3. Let G be an (n, d, α)-graph with normalized adjacency matrix Aˆ. Then for any distribution vector p and any positive integer t: ||Aˆt p − u||2 ≤ ||p − u||2αt ≤ αt . Obviously it suffices to prove this bound for t = 1 (shrinkage per step) and use induction. Lemma 3.4. For every probability vector p, ||Aˆp − u||2 ≤ α||p − u||2 ≤ α. Proof. The uniform distribution u is invariant under the action of Aˆ. Also, p − u is orthogonal to u and thus shrinks in l2-norm by a factor α under the action of Aˆ. Consequently ||Aˆp − u||2 = ||Aˆ(p − u)||2 ≤ α||p − u||2 ≤ α, where the last inequality follows easily from the fact that p is a probability distribution. 3.1.2. Convergence in entropy. Another important perspective of a random walk is offered by the entropy of the associated probability distributions. The entropy of probability distributions is a fundamental concept in the theory of communication, capturing the amount of “information”, or alternatively “uncertainty”, that it carries. When we take a random step, we “inject” more randomness into our distribution, indeed precisely the log d random bits that are needed to specify which of the d neighbors of the current vertex we move to next. One expects this injection to increase the amount of “randomness” in the distribution, namely its entropy. This is indeed always true in every regular graph, and expanders are those graphs for which the increase is significant. This entropy viewpoint will be extremely important when we explain the zigzag product and its use in combinatorial constructions of various expanders in Sections 9 and 10. In the same way that different norms capture different aspects of the probability distributions, there are several variations on the theme of entropy that do this. Let [n] denote the set of integers {1,...,n}. Then for a probability distribution p on [n] we define: • Shannon entropy: H(p ) = −n i=1 pi log(pi). • R´enyi 2-entropy: H2(p ) = −2 log(||p ||2). • Min entropy: H∞(p ) = − log(||p ||∞). To see the connection between the last two quantities, note that if p is a probability distribution on [n], then max pi ≥ p2 i ≥ max p2 i . It follows that: Proposition 3.5. H∞(p ) ≤ H2(p ) ≤ 2H∞(p ). Here are some simple and useful properties that are common to all three, which the reader is invited to verify. As above, p is a probability distribution on an n-element set, and we denote a “generic” entropy by H˜ . • H˜ (p ) ≥ 0 with equality iff the distribution is concentrated on a single element. • H˜ (p ) ≤ log n with equality iff the distribution is uniform
EXPANDER GRAPHS AND THEIR APPLICATIONS 461 For any doubly stochastic matrix X (nonnegative matrix whose row and column sums are one),H(Xp)>H(p).Equality holds iff p is uniform. The last item shows that entropy increases with every step of the random walk on a regular graph.Making this quantitative depends on the choice of entropy measure.Below we do so for the Renyi 2-entropy in terms of the spectral bound a,which(not surprisingly)is just a restatement of the l2 bound from the previous section.However,as noted above H2 and Ho are very close to each other,and it is the latter we use in Section 10,so this interpretation will be important for us. Before doing so,we remark that for Shannon entropy H,the precise relation between the increase in H and the spectral constant a is still unknown.However. one can define an analogous "entropy constant"which governs the increase "per step"in entropy.It is called the Log-Sobolev constant,and there are known quan- titative relations between it and and the spectral constant(much like the relations between edge expansion and the spectral constant of the previous section).Using the Log-Sobolev constant to analyze the mixing time of random walks is a powerful method,but it is beyond the scope of this survey.For more on this,see e.g.[MT]. Let us write the distribution as p=u+f,where fLu.We let u capture how close p is to the uniform distribution,via=f/p<1 (e.g.u=0 iff p is uniform).Then ApI2=u+Af2=lu2+IAfI2≤(1-2)+a2μ2)lp2 Hence H2(Ap)≥H2(p)-log(1-u2)+a2μ2)=H2(p)-1og(1-(1-a2)μ2) It follows that the 2-entropy never decreases and is,in fact,strictly increasing as long as the distribution p is not uniform.It is also clear that for better expanders (i.e.,for smaller a)the 2-entropy grows faster. 3.2.Random walks resemble independent sampling.In the sequel,we imag- ine an abstract sampling problem in which an unknown set B in a universe of size n is "bad"in some sense,and we try to sample the universe so as to avoid the bad set as much as possible.Our task will be to do so,minimizing the number of random bits used.In a motivating example we saw already that set B includes all the bad random choices for a probabilistic algorithm,namely,those choices for which it gives the wrong answer.We now describe the advantages of imposing,out of the blue,an expander graph structure on the universe.Using it,we can choose a small sample using a random walk on the graph.Remarkably,the statistics of hitting B with such a (highly dependent)sample will be very close to that of a completely independent sample (provided we pick the degree and expansion of the graph appropriately). Suppose that we are given an (n,d,a)-graph G=(V,E)where the vertices in some subset BCV are "bad".All we know about the set B is its cardinality B= Bn.We wish to sample at least one vertex outside of B.We can certainly sample. uniformly at random,t+1 vertices zo,...,t from V,and fail with probability PrMi zi E B]<+1.This approach uses (t+1)logn random bits,and we will show that a similar performance can be achieved with substantially fewer random bits:namely,that if we choose a random starting vertex and carry out a random walk of length t from it,then our chance of failure,i.e.,the probability that the whole random walk is confined to B,is exponentially small in t as well.To get
EXPANDER GRAPHS AND THEIR APPLICATIONS 461 • For any doubly stochastic matrix X (nonnegative matrix whose row and column sums are one), H˜ (Xp ) ≥ H˜ (p ). Equality holds iff p is uniform. The last item shows that entropy increases with every step of the random walk on a regular graph. Making this quantitative depends on the choice of entropy measure. Below we do so for the R´enyi 2-entropy in terms of the spectral bound α, which (not surprisingly) is just a restatement of the l2 bound from the previous section. However, as noted above H2 and H∞ are very close to each other, and it is the latter we use in Section 10, so this interpretation will be important for us. Before doing so, we remark that for Shannon entropy H, the precise relation between the increase in H and the spectral constant α is still unknown. However, one can define an analogous “entropy constant” which governs the increase “per step” in entropy. It is called the Log-Sobolev constant, and there are known quantitative relations between it and and the spectral constant (much like the relations between edge expansion and the spectral constant of the previous section). Using the Log-Sobolev constant to analyze the mixing time of random walks is a powerful method, but it is beyond the scope of this survey. For more on this, see e.g. [MT]. Let us write the distribution as p = u + f, where f ⊥ u. We let µ capture how close p is to the uniform distribution, via µ = f /p ≤ 1 (e.g. µ = 0 iff p is uniform). Then Aˆp 2 = u + Aˆf 2 = u 2 + Aˆf 2 ≤ ((1 − µ2) + α2µ2)p 2. Hence H2(Aˆp ) ≥ H2(p ) − log((1 − µ2) + α2µ2) = H2(p ) − log(1 − (1 − α2)µ2). It follows that the 2-entropy never decreases and is, in fact, strictly increasing as long as the distribution p is not uniform. It is also clear that for better expanders (i.e., for smaller α) the 2-entropy grows faster. 3.2. Random walks resemble independent sampling. In the sequel, we imagine an abstract sampling problem in which an unknown set B in a universe of size n is “bad” in some sense, and we try to sample the universe so as to avoid the bad set as much as possible. Our task will be to do so, minimizing the number of random bits used. In a motivating example we saw already that set B includes all the bad random choices for a probabilistic algorithm, namely, those choices for which it gives the wrong answer. We now describe the advantages of imposing, out of the blue, an expander graph structure on the universe. Using it, we can choose a small sample using a random walk on the graph. Remarkably, the statistics of hitting B with such a (highly dependent) sample will be very close to that of a completely independent sample (provided we pick the degree and expansion of the graph appropriately). Suppose that we are given an (n, d, α)-graph G = (V,E) where the vertices in some subset B ⊆ V are “bad”. All we know about the set B is its cardinality |B| = βn. We wish to sample at least one vertex outside of B. We can certainly sample, uniformly at random, t + 1 vertices x0,...,xt from V , and fail with probability Pr[∀i xi ∈ B] ≤ βt+1. This approach uses (t + 1) log n random bits, and we will show that a similar performance can be achieved with substantially fewer random bits: namely, that if we choose a random starting vertex and carry out a random walk of length t from it, then our chance of failure, i.e., the probability that the whole random walk is confined to B, is exponentially small in t as well. To get
462 SHLOMO HOORY.NATHAN LINIAL.AND AVI WIGDERSON started,let us reinterpret the Expander Mixing Lemma as the case t=1 of this approach.Recall that the lemma says that: dls]lt] n ,-lE(S,T)川 ≤advV1ST≤adn for every two subsets S,T CV(G)in an(n,d,a)-graph G.It is useful to rewrite this as: ISITI E(S.T)I n2 ≤a dn Now consider two experiments in which we sample an ordered pair of vertices(i,j)in G,and consider it a success if iE S and j E T.In the first version of this experiment both vertices are selected uniformly at random from V=V(G),so our probability of success is ST/n2.In the second version we randomly pick i V,and then j is selected uniformly among the neighbors of i.In this version of the experiment, the probability of success is E(S,T)/dn.The Expander Mixing Lemma says that despite the different nature of the sample spaces in the two experiments,their success probabilities differ only by a small constant a.We now turn to the complete argument that pertains to longer random walks. Let G=(V,E)be an (n,d,a)-graph,and B C V of cardinality B=Bn.We carry out the following experiment:We pick Xo E V uniformly at random and start from it a random walk Xo,...,Xt on G.Denote by (B,t)the event that this random walk is confined to B,i.e.that Vi Xi E B. Theorem 3.6 (Ajtai-Komlos-Szemeredi AKS87],Alon-Feige-Wigderson-Zucker- man [AFWZ95]).Let G be an (n,d,a)-graph and BCV with |B=Bn.Then the probability of the event(B,t)is bounded by Pr[(B,t]≤(B+a). Let P=PB be the orthogonal projection on the subspace of coordinates belong- ing to B.In matrix notation Pij=1 if i=jE B and 0 otherwise.We need the following simple observation: Lemma 3.7.The probability of event(B,t)is given by Pr[(B,t)]=(PA)'Pull1. Proof.To calculate the (x,y)entry in the matrix(A)t,we should sum the probabil- ities of walks of length t that start at vertex x and end at y.In the same calculations for the matrix (PA)'P only walks confined to B are to be considered,since all other contributions are eliminated by the matrices P in the product.But (PA)'Pull is just of the total sum of these entries and the conclusion follows. We also need the following lemma. Lemma 3.8.For any vector v, PAPV2≤(3+a)·lvll2. Proof.First note that there is no loss in assuming that v is supported on B,i.e Pv =v.Otherwise we may replace v by Pv.This leaves the left-hand side unchanged and does not increase the right-hand side,since P is a contraction in l2. Similarly,we may assume that v is nonnegative.Also,by linearity of both sides we may assume that >vi=1 and so v can be expressed as:Pv v=u+z where z is orthogonal to u.It follows that PAPy=PAu+PAz=Pu+PAz
462 SHLOMO HOORY, NATHAN LINIAL, AND AVI WIGDERSON started, let us reinterpret the Expander Mixing Lemma as the case t = 1 of this approach. Recall that the lemma says that: d|S||T| n − |E(S, T)| ≤ αd|S||T| ≤ αdn, for every two subsets S, T ⊆ V (G) in an (n, d, α)-graph G. It is useful to rewrite this as: |S||T| n2 − |E(S, T)| dn ≤ α. Now consider two experiments in which we sample an ordered pair of vertices (i, j) in G, and consider it a success if i ∈ S and j ∈ T. In the first version of this experiment both vertices are selected uniformly at random from V = V (G), so our probability of success is |S||T|/n2. In the second version we randomly pick i ∈ V , and then j is selected uniformly among the neighbors of i. In this version of the experiment, the probability of success is |E(S, T)|/dn. The Expander Mixing Lemma says that despite the different nature of the sample spaces in the two experiments, their success probabilities differ only by a small constant α. We now turn to the complete argument that pertains to longer random walks. Let G = (V,E) be an (n, d, α)-graph, and B ⊂ V of cardinality |B| = βn. We carry out the following experiment: We pick X0 ∈ V uniformly at random and start from it a random walk X0,...,Xt on G. Denote by (B,t) the event that this random walk is confined to B, i.e. that ∀i Xi ∈ B. Theorem 3.6 (Ajtai-Koml´os-Szemer´edi [AKS87], Alon-Feige-Wigderson-Zuckerman [AFWZ95]). Let G be an (n, d, α)-graph and B ⊂ V with |B| = βn. Then the probability of the event (B,t) is bounded by Pr[(B,t)] ≤ (β + α) t . Let P = PB be the orthogonal projection on the subspace of coordinates belonging to B. In matrix notation Pij = 1 if i = j ∈ B and 0 otherwise. We need the following simple observation: Lemma 3.7. The probability of event (B,t) is given by Pr[(B,t)] = ||(PAˆ)t Pu||1. Proof. To calculate the (x, y) entry in the matrix (Aˆ)t , we should sum the probabilities of walks of length t that start at vertex x and end at y. In the same calculations for the matrix (PAˆ)t P only walks confined to B are to be considered, since all other contributions are eliminated by the matrices P in the product. But ||(PAˆ)t Pu||1 is just 1 n of the total sum of these entries and the conclusion follows. We also need the following lemma. Lemma 3.8. For any vector v, ||PAPˆ v||2 ≤ (β + α) · ||v||2. Proof. First note that there is no loss in assuming that v is supported on B, i.e. Pv = v. Otherwise we may replace v by Pv. This leaves the left-hand side unchanged and does not increase the right-hand side, since P is a contraction in l2. Similarly, we may assume that v is nonnegative. Also, by linearity of both sides we may assume that vi = 1 and so v can be expressed as: Pv = v = u + z where z is orthogonal to u. It follows that PAPˆ v = PAˆu + PAˆz = Pu + PAˆz
EXPANDER GRAPHS AND THEIR APPLICATIONS 463 and hence PAPvll2 <Pull2+PAzl2. We now prove that Pull2 <B.llvll2 and IPAzll2 a.llvll2,which together imply the claim. Since >vi=1,and the support of v has at most Bn coordinates,Cauchy- Schwartz yields1=∑≤vBn·lvl2.But,sincePull2=V/n,we obtain IPul2≤B-lvl2: As for the other term,Az2<a2,since z is orthogonal to u and therefore is a combination of eigenvectors of A with eigenvalues a.But PAz2<Az2, since P is a contraction in l2.Also v is the sum of z and the orthogonal vector u, whencez2≤Ilvll2.It follows that PAz2≤a·lvl2,as needed. Now we use the two lemmas to prove Theorem 3.6. Proof.(Theorem 3.6) lI(PA)'Pul≤vV元·I(PA)Pul2 =√元·lI(PAP)ull2 ≤v元·(B+a)lul2 =(B+a) ▣ The probability that t+1 uniformly and independently sampled vertices all land in a set B of density B is +1.Is it true that Pr[(B,t)]is very close to this bound for a sufficiently good expander?Theorem 3.6 falls a factor of short of this bound (as we have t+1 sample points).It also does not provide a comparable lower bound. The following result,which we do not prove here,fills this gap. Theorem 3.9 ([AFWZ95]).If B>6a,then B.(B+2a)≥Pr[(B,t]≥B.(3-2a). It is also possible to derive "time dependent"versions of the upper bound through simple adaptations of the proof of Theorem 3.6.This will be useful for us a little later. Theorem 3.10.For every subset K C(0,...,t}and verter subset B of density B, Pr[Xi∈B for all i∈K]≤(B+a)Ik-1 Occasionally we have to deal with a situation where the excluded set varies between time steps.This is addressed in the following theorem: Theorem 3.11.Let Bo:...,B:be vertex sets of densities Bo,....B:in an (n,d,a)- graph G.Let Xo:...,X:be a random walk on G.Then t-1 PrX,eB,for all司≤Π(W+i+a) i=0 This follows by a small adaptation of the previous arguments.Let Pi be the projection on Bi,and note that Pr[Xi E Bi for all i]=:=1(PiA)Poull1,and lP+1 APvIl2≤(VE+1+a)·lvll2:
EXPANDER GRAPHS AND THEIR APPLICATIONS 463 and hence ||PAPˆ v||2 ≤ ||Pu||2 + ||PAˆz||2. We now prove that ||Pu||2 ≤ β · ||v||2 and ||PAˆz||2 ≤ α · ||v||2, which together imply the claim. Since vi = 1, and the support of v has at most βn coordinates, CauchySchwartz yields 1 = vi ≤ √βn · ||v||2. But, since ||Pu||2 = β/n, we obtain ||Pu||2 ≤ β · ||v||2. As for the other term, ||Aˆz||2 ≤ α||z||2, since z is orthogonal to u and therefore is a combination of eigenvectors of Aˆ with eigenvalues ≤ α. But ||PAˆz||2 ≤ ||Aˆz||2, since P is a contraction in l2. Also v is the sum of z and the orthogonal vector u, whence ||z||2 ≤ ||v||2. It follows that ||PAˆz||2 ≤ α · ||v||2, as needed. Now we use the two lemmas to prove Theorem 3.6. Proof. (Theorem 3.6) ||(PAˆ) t Pu||1 ≤ √n · ||(PAˆ) t Pu||2 = √n · ||(PAPˆ ) t u||2 ≤ √n · (β + α) t ||u||2 = (β + α) t . The probability that t+1 uniformly and independently sampled vertices all land in a set B of density β is βt+1. Is it true that Pr[(B,t)] is very close to this bound for a sufficiently good expander? Theorem 3.6 falls a factor of β short of this bound (as we have t+1 sample points). It also does not provide a comparable lower bound. The following result, which we do not prove here, fills this gap. Theorem 3.9 ([AFWZ95]). If β > 6α, then β · (β + 2α) t ≥ Pr[(B,t)] ≥ β · (β − 2α) t . It is also possible to derive “time dependent” versions of the upper bound through simple adaptations of the proof of Theorem 3.6. This will be useful for us a little later. Theorem 3.10. For every subset K ⊂ {0,...,t} and vertex subset B of density β, Pr[Xi ∈ B for all i ∈ K] ≤ (β + α) |K|−1. Occasionally we have to deal with a situation where the excluded set varies between time steps. This is addressed in the following theorem: Theorem 3.11. Let B0,...,Bt be vertex sets of densities β0,...,βt in an (n, d, α)- graph G. Let X0,...,Xt be a random walk on G. Then Pr[Xi ∈ Bi for all i] ≤ t −1 i=0 ( βiβi+1 + α). This follows by a small adaptation of the previous arguments. Let Pi be the projection on Bi, and note that Pr[Xi ∈ Bi for all i] = || t i=1(PiAˆ)P0u||1, and ||Pi+1APˆ iv||2 ≤ ( βiβi+1 + α) · ||v||2