16 1.INTRODUCTION p(X,Y) p(Y) Y=2 X p(X) p(XIY =1) X X Figure 1.11 An illustration of a distribution over two variables,X,which takes 9 possible values,and Y,which takes two possible values.The top left figure shows a sample of 60 points drawn from a joint probability distri- bution over these variables.The remaining figures show histogram estimates of the marginal distributions p(X) and p(Y),as well as the conditional distribution p(XY 1)corresponding to the bottom row in the top left figure. Again,note that these probabilities are normalized so that p(F=aB=r)+p(F=oB=r)=1 (1.20) and similarly p(F=aB=b)+p(F=oB=b)=1. (1.21) We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple p(F=a)=p(F=aB=r)p(B=r)+p(F=aB=b)p(B=b) 143611 4×10+4×10=20 -X (1.22) from which it follows,using the sum rule,that p(F=o)=1-11/20 9/20
16 1. INTRODUCTION p(X,Y ) X Y = 2 Y = 1 p(Y ) p(X) X X p(X|Y = 1) Figure 1.11 An illustration of a distribution over two variables, X, which takes 9 possible values, and Y , which takes two possible values. The top left figure shows a sample of 60 points drawn from a joint probability distribution over these variables. The remaining figures show histogram estimates of the marginal distributions p(X) and p(Y ), as well as the conditional distribution p(X|Y = 1) corresponding to the bottom row in the top left figure. Again, note that these probabilities are normalized so that p(F = a|B = r) + p(F = o|B = r)=1 (1.20) and similarly p(F = a|B = b) + p(F = o|B = b)=1. (1.21) We can now use the sum and product rules of probability to evaluate the overall probability of choosing an apple p(F = a) = p(F = a|B = r)p(B = r) + p(F = a|B = b)p(B = b) = 1 4 × 4 10 + 3 4 × 6 10 = 11 20 (1.22) from which it follows, using the sum rule, that p(F = o)=1 − 11/20 = 9/20
1.2.Probability Theory 17 Suppose instead we are told that a piece of fruit has been selected and it is an orange,and we would like to know which box it came from.This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit,whereas the probabilities in (1.16)-(1.19)give the probability distribution over the fruit conditioned on the identity of the box.We can solve the problem of reversing the conditional probability by using Bayes'theorem to give p(B=rIr=0)=(F=dB=r)n(B-r)=3x4x202 =4×10×9=3 (1.23) p(F=o) From the sum rule,it then follows that p(B=bF=o)=1-2/3 1/3 We can provide an important interpretation of Bayes'theorem as follows.If we had been asked which box had been chosen before being told the identity of the selected item of fruit,then the most complete information we have available is provided by the probability p(B).We call this the prior probability because it is the probability available before we observe the identity of the fruit.Once we are told that the fruit is an orange,we can then use Bayes'theorem to compute the probability p(BF),which we shall call the posterior probability because it is the probability obtained after we have observed F.Note that in this example,the prior probability of selecting the red box was 4/10,so that we were more likely to select the blue box than the red one.However,once we have observed that the piece of selected fruit is an orange,we find that the posterior probability of the red box is now 2/3,so that it is now more likely that the box we selected was in fact the red one.This result accords with our intuition,as the proportion of oranges is much higher in the red box than it is in the blue box,and so the observation that the fruit was an orange provides significant evidence favouring the red box.In fact,the evidence is sufficiently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one. Finally,we note that if the joint distribution of two variables factorizes into the product of the marginals,so that p(X,Y)=p(X)p(Y),then X and Y are said to be independent.From the product rule,we see that p(YX)=p(Y),and so the conditional distribution of y given X is indeed independent of the value of X.For instance,in our boxes of fruit example,if each box contained the same fraction of apples and oranges,then p(FB)=P(F),so that the probability of selecting,say, an apple is independent of which box is chosen. 1.2.1 Probability densities As well as considering probabilities defined over discrete sets of events,we also wish to consider probabilities with respect to continuous variables.We shall limit ourselves to a relatively informal discussion.If the probability of a real-valued variable x falling in the interval (x+6x)is given by p(x)ox for ox -0,then p(x)is called the probability density over x.This is illustrated in Figure 1.12.The probability that x will lie in an interval (a,b)is then given by p(x∈(a,b) p(z)dz. (1.24)
1.2. Probability Theory 17 Suppose instead we are told that a piece of fruit has been selected and it is an orange, and we would like to know which box it came from. This requires that we evaluate the probability distribution over boxes conditioned on the identity of the fruit, whereas the probabilities in (1.16)–(1.19) give the probability distribution over the fruit conditioned on the identity of the box. We can solve the problem of reversing the conditional probability by using Bayes’ theorem to give p(B = r|F = o) = p(F = o|B = r)p(B = r) p(F = o) = 3 4 × 4 10 × 20 9 = 2 3 . (1.23) From the sum rule, it then follows that p(B = b|F = o)=1 − 2/3=1/3. We can provide an important interpretation of Bayes’ theorem as follows. If we had been asked which box had been chosen before being told the identity of the selected item of fruit, then the most complete information we have available is provided by the probability p(B). We call this the prior probability because it is the probability available before we observe the identity of the fruit. Once we are told that the fruit is an orange, we can then use Bayes’ theorem to compute the probability p(B|F), which we shall call the posterior probability because it is the probability obtained after we have observed F. Note that in this example, the prior probability of selecting the red box was 4/10, so that we were more likely to select the blue box than the red one. However, once we have observed that the piece of selected fruit is an orange, we find that the posterior probability of the red box is now 2/3, so that it is now more likely that the box we selected was in fact the red one. This result accords with our intuition, as the proportion of oranges is much higher in the red box than it is in the blue box, and so the observation that the fruit was an orange provides significant evidence favouring the red box. In fact, the evidence is sufficiently strong that it outweighs the prior and makes it more likely that the red box was chosen rather than the blue one. Finally, we note that if the joint distribution of two variables factorizes into the product of the marginals, so that p(X, Y ) = p(X)p(Y ), then X and Y are said to be independent. From the product rule, we see that p(Y |X) = p(Y ), and so the conditional distribution of Y given X is indeed independent of the value of X. For instance, in our boxes of fruit example, if each box contained the same fraction of apples and oranges, then p(F|B) = P(F), so that the probability of selecting, say, an apple is independent of which box is chosen. 1.2.1 Probability densities As well as considering probabilities defined over discrete sets of events, we also wish to consider probabilities with respect to continuous variables. We shall limit ourselves to a relatively informal discussion. If the probability of a real-valued variable x falling in the interval (x, x + δx) is given by p(x)δx for δx → 0, then p(x) is called the probability density over x. This is illustrated in Figure 1.12. The probability that x will lie in an interval (a, b) is then given by p(x ∈ (a, b)) = & b a p(x) dx. (1.24)
18 1.INTRODUCTION Figure 1.12 The concept of probability for discrete variables can be ex- P(x) tended to that of a probability p(x) density p(x)over a continuous variable and is such that the probability of x lying in the inter- val (r+6x)is given by p(x)ox for 6x0.The probability density can be expressed as the derivative of a cumulative distri- bution function P(). OT Because probabilities are nonnegative,and because the value of r must lie some- where on the real axis,the probability density p(x)must satisfy the two conditions p(x)≥0 (1.25) p(x)dx =1. (1.26) Under a nonlinear change of variable,a probability density transforms differently from a simple function,due to the Jacobian factor.For instance,if we consider a change of variables x =g(y),then a function f(r)becomes f(y)=f(g(y)). Now consider a probability density p()that corresponds to a density p(y)with respect to the new variable y,where the suffices denote the fact that p()and pu(y) are different densities.Observations falling in the range (x,x +or)will,for small values of or,be transformed into the range (y,y+oy)where p()oxpy(y)oy, and hence dx Pg())=pz(x) dy p(g(y))lg(y)l. (1.27) One consequence of this property is that the concept of the maximum of a probability Exercise 1.4 density is dependent on the choice of variable. The probability that x lies in the interval (-oo,z)is given by the cumulative distribution function defined by P(z)= p(x)dx (1.28) which satisfies P'(x)=p(),as shown in Figure 1.12. If we have several continuous variables ,...,p,denoted collectively by the vector x,then we can define a joint probability density p(x)=p(x1,...,p)such
18 1. INTRODUCTION Figure 1.12 The concept of probability for discrete variables can be extended to that of a probability density p(x) over a continuous variable x and is such that the probability of x lying in the interval (x, x+δx) is given by p(x)δx for δx → 0. The probability density can be expressed as the derivative of a cumulative distribution function P(x). δx x p(x) P(x) Because probabilities are nonnegative, and because the value of x must lie somewhere on the real axis, the probability density p(x) must satisfy the two conditions p(x) " 0 (1.25) & ∞ −∞ p(x) dx = 1. (1.26) Under a nonlinear change of variable, a probability density transforms differently from a simple function, due to the Jacobian factor. For instance, if we consider a change of variables x = g(y), then a function f(x) becomes f $(y) = f(g(y)). Now consider a probability density px(x) that corresponds to a density py(y) with respect to the new variable y, where the suffices denote the fact that px(x) and py(y) are different densities. Observations falling in the range (x, x + δx) will, for small values of δx, be transformed into the range (y, y + δy) where px(x)δx ≃ py(y)δy, and hence py(y) = px(x) ' ' ' ' dx dy ' ' ' ' = px(g(y))|g′ (y)| . (1.27) One consequence of this property is that the concept of the maximum of a probability Exercise 1.4 density is dependent on the choice of variable. The probability that x lies in the interval (−∞, z) is given by the cumulative distribution function defined by P(z) = & z −∞ p(x) dx (1.28) which satisfies P′ (x) = p(x), as shown in Figure 1.12. If we have several continuous variables x1,...,xD, denoted collectively by the vector x, then we can define a joint probability density p(x) = p(x1,...,xD) such
1.2.Probability Theory 19 that the probability of x falling in an infinitesimal volume ox containing the point x is given by p(x)ox.This multivariate probability density must satisfy p(x)≥.0 (1.29) p(x)dx 1 (1.30) in which the integral is taken over the whole of x space.We can also consider joint probability distributions over a combination of discrete and continuous variables. Note that if x is a discrete variable,then p()is sometimes called a probability mass function because it can be regarded as a set of 'probability masses'concentrated at the allowed values of x. The sum and product rules of probability,as well as Bayes'theorem,apply equally to the case of probability densities,or to combinations of discrete and con- tinuous variables.For instance,if x and y are two real variables,then the sum and product rules take the form p(x) p(x,y)dy (1.31) p(x,)=p(z)p(x) (1.32) A formal justification of the sum and product rules for continuous variables (Feller, 1966)requires a branch of mathematics called measure theory and lies outside the scope of this book.Its validity can be seen informally,however,by dividing each real variable into intervals of width A and considering the discrete probability dis- tribution over these intervals.Taking the limit A-0 then turns sums into integrals and gives the desired result. 1.2.2 Expectations and covariances One of the most important operations involving probabilities is that of finding weighted averages of functions.The average value of some function f(x)under a probability distribution p()is called the expectation of f(x)and will be denoted by Ef].For a discrete distribution,it is given by 1=∑p)f) (1.33) so that the average is weighted by the relative probabilities of the different values of x.In the case of continuous variables,expectations are expressed in terms of an integration with respect to the corresponding probability density lf]=/p(c)f(x)dz. (1.34) In either case,if we are given a finite number N of points drawn from the probability distribution or probability density,then the expectation can be approximated as a
1.2. Probability Theory 19 that the probability of x falling in an infinitesimal volume δx containing the point x is given by p(x)δx. This multivariate probability density must satisfy p(x) " 0 (1.29) & p(x) dx = 1 (1.30) in which the integral is taken over the whole of x space. We can also consider joint probability distributions over a combination of discrete and continuous variables. Note that if x is a discrete variable, then p(x) is sometimes called a probability mass function because it can be regarded as a set of ‘probability masses’ concentrated at the allowed values of x. The sum and product rules of probability, as well as Bayes’ theorem, apply equally to the case of probability densities, or to combinations of discrete and continuous variables. For instance, if x and y are two real variables, then the sum and product rules take the form p(x) = & p(x, y) dy (1.31) p(x, y) = p(y|x)p(x). (1.32) A formal justification of the sum and product rules for continuous variables (Feller, 1966) requires a branch of mathematics called measure theory and lies outside the scope of this book. Its validity can be seen informally, however, by dividing each real variable into intervals of width ∆ and considering the discrete probability distribution over these intervals. Taking the limit ∆ → 0 then turns sums into integrals and gives the desired result. 1.2.2 Expectations and covariances One of the most important operations involving probabilities is that of finding weighted averages of functions. The average value of some function f(x) under a probability distribution p(x) is called the expectation of f(x) and will be denoted by E[f]. For a discrete distribution, it is given by E[f] = " x p(x)f(x) (1.33) so that the average is weighted by the relative probabilities of the different values of x. In the case of continuous variables, expectations are expressed in terms of an integration with respect to the corresponding probability density E[f] = & p(x)f(x) dx. (1.34) In either case, if we are given a finite number N of points drawn from the probability distribution or probability density, then the expectation can be approximated as a
20 1.INTRODUCTION finite sum over these points N 衣∑fen (1.35) n=1 We shall make extensive use of this result when we discuss sampling methods in Chapter 11.The approximation in(1.35)becomes exact in the limit N-oo. Sometimes we will be considering expectations of functions of several variables, in which case we can use a subscript to indicate which variable is being averaged over.so that for instance E(f(z,y)] (1.36) denotes the average of the function f(,y)with respect to the distribution of c.Note that Ef(,y)will be a function of y. We can also consider a conditional expectation with respect to a conditional distribution,so that E.lf=∑pfo) (1.37) with an analogous definition for continuous variables. The variance of f(r)is defined by var[f]=E [(f()-E[f(z)])2] (1.38) and provides a measure of how much variability there is in f(x)around its mean value Ef(x).Expanding out the square,we see that the variance can also be written Exercise 1.5 in terms of the expectations of f()and f(r)2 var[f]=E[f(x)2]-E[f()]2. (1.39) In particular,we can consider the variance of the variable z itself,which is given by var[z]E(x2]-E[x)2. (1.40) For two random variables r and y,the covariance is defined by cov[,y]=Eiy [fz-E[c])y-Ely]}] =E,yx-EzE (1.41) which expresses the extent to which x and y vary together.If x and y are indepen- Exercise 1.6 dent,then their covariance vanishes. In the case of two vectors of random variables x and y,the covariance is a matrix cov[x,y]Ex.y [{x-E[x]}yT-ElyT]}] =Ex,yxy]-ExEy鬥]. (1.42) If we consider the covariance of the components of a vector x with each other,then we use a slightly simpler notation covx=cov x,x
20 1. INTRODUCTION finite sum over these points E[f] ≃ 1 N " N n=1 f(xn). (1.35) We shall make extensive use of this result when we discuss sampling methods in Chapter 11. The approximation in (1.35) becomes exact in the limit N → ∞. Sometimes we will be considering expectations of functions of several variables, in which case we can use a subscript to indicate which variable is being averaged over, so that for instance Ex[f(x, y)] (1.36) denotes the average of the function f(x, y) with respect to the distribution of x. Note that Ex[f(x, y)] will be a function of y. We can also consider a conditional expectation with respect to a conditional distribution, so that Ex[f|y] = " x p(x|y)f(x) (1.37) with an analogous definition for continuous variables. The variance of f(x) is defined by var[f] = E ( (f(x) − E[f(x)])2) (1.38) and provides a measure of how much variability there is in f(x) around its mean value E[f(x)]. Expanding out the square, we see that the variance can also be written in terms of the expectations of f(x) and f(x)2 Exercise 1.5 var[f] = E[f(x) 2] − E[f(x)]2. (1.39) In particular, we can consider the variance of the variable x itself, which is given by var[x] = E[x2] − E[x] 2. (1.40) For two random variables x and y, the covariance is defined by cov[x, y] = Ex,y [{x − E[x]} {y − E[y]}] = Ex,y[xy] − E[x]E[y] (1.41) which expresses the extent to which x and y vary together. If x and y are indepenExercise 1.6 dent, then their covariance vanishes. In the case of two vectors of random variables x and y, the covariance is a matrix cov[x, y] = Ex,y ( {x − E[x]}{yT − E[yT]} ) = Ex,y[xyT] − E[x]E[yT]. (1.42) If we consider the covariance of the components of a vector x with each other, then we use a slightly simpler notation cov[x] ≡ cov[x, x]