1.2.Probability Theory 21 1.2.3 Bayesian probabilities So far in this chapter,we have viewed probabilities in terms of the frequencies of random,repeatable events.We shall refer to this as the classical or frequentist interpretation of probability.Now we turn to the more general Bayesian view,in which probabilities provide a quantification of uncertainty. Consider an uncertain event,for example whether the moon was once in its own orbit around the sun,or whether the Arctic ice cap will have disappeared by the end of the century.These are not events that can be repeated numerous times in order to define a notion of probability as we did earlier in the context of boxes of fruit. Nevertheless,we will generally have some idea,for example,of how quickly we think the polar ice is melting.If we now obtain fresh evidence,for instance from a new Earth observation satellite gathering novel forms of diagnostic information,we may revise our opinion on the rate of ice loss.Our assessment of such matters will affect the actions we take,for instance the extent to which we endeavour to reduce the emission of greenhouse gasses.In such circumstances,we would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence,as well as subsequently to be able to take optimal actions or decisions as a consequence.This can all be achieved through the elegant,and very general,Bayesian interpretation of probability. The use of probability to represent uncertainty,however,is not an ad-hoc choice, but is inevitable if we are to respect common sense while making rational coherent inferences.For instance,Cox (1946)showed that if numerical values are used to represent degrees of belief,then a simple set of axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for manipulating degrees of belief that are equivalent to the sum and product rules of probability.This provided the first rigorous proof that probability theory could be regarded as an extension of Boolean logic to situations involving uncertainty (Jaynes,2003).Numerous other authors have proposed different sets of properties or axioms that such measures of uncertainty should satisfy (Ramsey,1931;Good,1950;Savage,1961;deFinetti, 1970;Lindley,1982).In each case,the resulting numerical quantities behave pre- cisely according to the rules of probability.It is therefore natural to refer to these quantities as(Bayesian)probabilities. In the field of pattern recognition,too,it is helpful to have a more general no- Thomas Bayes gambling and with the new concept of insurance.One 1701-1761 particularly important problem concerned so-called in- verse probability.A solution was proposed by Thomas Thomas Bayes was born in Tun- Bayes in his paper 'Essay towards solving a problem bridge Wells and was a clergyman in the doctrine of chances',which was published in as well as an amateur scientist and 1764,some three years after his death,in the Philo- a mathematician.He studied logic sophical Transactions of the Royal Society.In fact, and theology at Edinburgh Univer- Bayes only formulated his theory for the case of a uni- sity and was elected Fellow of the form prior,and it was Pierre-Simon Laplace who inde- Royal Society in 1742.During the 18thcentury,is- pendently rediscovered the theory in general form and sues regarding probability arose in connection with who demonstrated its broad applicability
1.2. Probability Theory 21 1.2.3 Bayesian probabilities So far in this chapter, we have viewed probabilities in terms of the frequencies of random, repeatable events. We shall refer to this as the classical or frequentist interpretation of probability. Now we turn to the more general Bayesian view, in which probabilities provide a quantification of uncertainty. Consider an uncertain event, for example whether the moon was once in its own orbit around the sun, or whether the Arctic ice cap will have disappeared by the end of the century. These are not events that can be repeated numerous times in order to define a notion of probability as we did earlier in the context of boxes of fruit. Nevertheless, we will generally have some idea, for example, of how quickly we think the polar ice is melting. If we now obtain fresh evidence, for instance from a new Earth observation satellite gathering novel forms of diagnostic information, we may revise our opinion on the rate of ice loss. Our assessment of such matters will affect the actions we take, for instance the extent to which we endeavour to reduce the emission of greenhouse gasses. In such circumstances, we would like to be able to quantify our expression of uncertainty and make precise revisions of uncertainty in the light of new evidence, as well as subsequently to be able to take optimal actions or decisions as a consequence. This can all be achieved through the elegant, and very general, Bayesian interpretation of probability. The use of probability to represent uncertainty, however, is not an ad-hoc choice, but is inevitable if we are to respect common sense while making rational coherent inferences. For instance, Cox (1946) showed that if numerical values are used to represent degrees of belief, then a simple set of axioms encoding common sense properties of such beliefs leads uniquely to a set of rules for manipulating degrees of belief that are equivalent to the sum and product rules of probability. This provided the first rigorous proof that probability theory could be regarded as an extension of Boolean logic to situations involving uncertainty (Jaynes, 2003). Numerous other authors have proposed different sets of properties or axioms that such measures of uncertainty should satisfy (Ramsey, 1931; Good, 1950; Savage, 1961; deFinetti, 1970; Lindley, 1982). In each case, the resulting numerical quantities behave precisely according to the rules of probability. It is therefore natural to refer to these quantities as (Bayesian) probabilities. In the field of pattern recognition, too, it is helpful to have a more general noThomas Bayes 1701–1761 Thomas Bayes was born in Tunbridge Wells and was a clergyman as well as an amateur scientist and a mathematician. He studied logic and theology at Edinburgh University and was elected Fellow of the Royal Society in 1742. During the 18th century, issues regarding probability arose in connection with gambling and with the new concept of insurance. One particularly important problem concerned so-called inverse probability. A solution was proposed by Thomas Bayes in his paper ‘Essay towards solving a problem in the doctrine of chances’, which was published in 1764, some three years after his death, in the Philosophical Transactions of the Royal Society. In fact, Bayes only formulated his theory for the case of a uniform prior, and it was Pierre-Simon Laplace who independently rediscovered the theory in general form and who demonstrated its broad applicability
22 1.INTRODUCTION tion of probability.Consider the example of polynomial curve fitting discussed in Section 1.1.It seems reasonable to apply the frequentist notion of probability to the random values of the observed variables t.However,we would like to address and quantify the uncertainty that surrounds the appropriate choice for the model param- eters w.We shall see that,from a Bayesian perspective,we can use the machinery of probability theory to describe the uncertainty in model parameters such as w,or indeed in the choice of model itself. Bayes'theorem now acquires a new significance.Recall that in the boxes of fruit example,the observation of the identity of the fruit provided relevant information that altered the probability that the chosen box was the red one.In that example, Bayes'theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data.As we shall see in detail later,we can adopt a similar approach when making inferences about quantities such as the parameters w in the polynomial curve fitting example.We capture our assumptions about w,before observing the data,in the form of a prior probability distribution p(w).The effect of the observed data D =[t1,...,t is expressed through the conditional probability p(Dw),and we shall see later,in Section 1.2.5, how this can be represented explicitly.Bayes'theorem,which takes the form p(wlD)=D(Dlw)p(w) (1.43) P(D) then allows us to evaluate the uncertainty in w after we have observed D in the form of the posterior probability p(wD). The quantity p(Dw)on the right-hand side of Bayes'theorem is evaluated for the observed data set D and can be viewed as a function of the parameter vector w,in which case it is called the likelihood function.It expresses how probable the observed data set is for different settings of the parameter vector w.Note that the likelihood is not a probability distribution over w,and its integral with respect to w does not(necessarily)equal one. Given this definition of likelihood,we can state Bayes'theorem in words posterior o likelihood x prior (1.44) where all of these quantities are viewed as functions of w.The denominator in (1.43)is the normalization constant,which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one.Indeed, integrating both sides of(1.43)with respect to w,we can express the denominator in Bayes'theorem in terms of the prior distribution and the likelihood function p(D)=/p(Dlw)p(w)dw. (1.45) In both the Bayesian and frequentist paradigms,the likelihood function p(Dw) plays a central role.However,the manner in which it is used is fundamentally dif- ferent in the two approaches.In a frequentist setting,w is considered to be a fixed parameter,whose value is determined by some form of 'estimator',and error bars
22 1. INTRODUCTION tion of probability. Consider the example of polynomial curve fitting discussed in Section 1.1. It seems reasonable to apply the frequentist notion of probability to the random values of the observed variables tn. However, we would like to address and quantify the uncertainty that surrounds the appropriate choice for the model parameters w. We shall see that, from a Bayesian perspective, we can use the machinery of probability theory to describe the uncertainty in model parameters such as w, or indeed in the choice of model itself. Bayes’ theorem now acquires a new significance. Recall that in the boxes of fruit example, the observation of the identity of the fruit provided relevant information that altered the probability that the chosen box was the red one. In that example, Bayes’ theorem was used to convert a prior probability into a posterior probability by incorporating the evidence provided by the observed data. As we shall see in detail later, we can adopt a similar approach when making inferences about quantities such as the parameters w in the polynomial curve fitting example. We capture our assumptions about w, before observing the data, in the form of a prior probability distribution p(w). The effect of the observed data D = {t1,...,tN } is expressed through the conditional probability p(D|w), and we shall see later, in Section 1.2.5, how this can be represented explicitly. Bayes’ theorem, which takes the form p(w|D) = p(D|w)p(w) p(D) (1.43) then allows us to evaluate the uncertainty in w after we have observed D in the form of the posterior probability p(w|D). The quantity p(D|w) on the right-hand side of Bayes’ theorem is evaluated for the observed data set D and can be viewed as a function of the parameter vector w, in which case it is called the likelihood function. It expresses how probable the observed data set is for different settings of the parameter vector w. Note that the likelihood is not a probability distribution over w, and its integral with respect to w does not (necessarily) equal one. Given this definition of likelihood, we can state Bayes’ theorem in words posterior ∝ likelihood × prior (1.44) where all of these quantities are viewed as functions of w. The denominator in (1.43) is the normalization constant, which ensures that the posterior distribution on the left-hand side is a valid probability density and integrates to one. Indeed, integrating both sides of (1.43) with respect to w, we can express the denominator in Bayes’ theorem in terms of the prior distribution and the likelihood function p(D) = & p(D|w)p(w) dw. (1.45) In both the Bayesian and frequentist paradigms, the likelihood function p(D|w) plays a central role. However, the manner in which it is used is fundamentally different in the two approaches. In a frequentist setting, w is considered to be a fixed parameter, whose value is determined by some form of ‘estimator’, and error bars
1.2.Probability Theory 23 on this estimate are obtained by considering the distribution of possible data sets D. By contrast,from the Bayesian viewpoint there is only a single data set D(namely the one that is actually observed),and the uncertainty in the parameters is expressed through a probability distribution over w. A widely used frequentist estimator is maximum likelihood,in which w is set to the value that maximizes the likelihood function p(Dw).This corresponds to choosing the value of w for which the probability of the observed data set is maxi- mized.In the machine learning literature,the negative log of the likelihood function is called an error function.Because the negative logarithm is a monotonically de- creasing function,maximizing the likelihood is equivalent to minimizing the error. One approach to determining frequentist error bars is the bootstrap (Efron,1979; Hastie et al.,2001),in which multiple data sets are created as follows.Suppose our original data set consists of N data pointsX={x1,...,xN).We can create a new data set XB by drawing N points at random from X,with replacement,so that some points in X may be replicated in XB,whereas other points in X may be absent from XB.This process can be repeated L times to generate L data sets each of size N and each obtained by sampling from the original data set X.The statistical accuracy of parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap data sets. One advantage of the Bayesian viewpoint is that the inclusion of prior knowl- edge arises naturally.Suppose,for instance,that a fair-looking coin is tossed three times and lands heads each time.A classical maximum likelihood estimate of the Section 2.1 probability of landing heads would give 1,implying that all future tosses will land heads!By contrast,a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion. There has been much controversy and debate associated with the relative mer- its of the frequentist and Bayesian paradigms,which have not been helped by the fact that there is no unique frequentist,or even Bayesian,viewpoint.For instance, one common criticism of the Bayesian approach is that the prior distribution is of- ten selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs.Even the subjective nature of the conclusions through their de- pendence on the choice of prior is seen by some as a source of difficulty.Reducing Section 2.4.3 the dependence on the prior is one motivation for so-called noninformative priors. However,these lead to difficulties when comparing different models,and indeed Bayesian methods based on poor choices of prior can give poor results with high confidence.Frequentist evaluation methods offer some protection from such prob- Section 1.3 lems,and techniques such as cross-validation remain useful in areas such as model comparison. This book places a strong emphasis on the Bayesian viewpoint,reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required. Although the Bayesian framework has its origins in the 18th century,the prac- tical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure,particularly the need to marginalize (sum or integrate)over the whole of parameter space,which,as we shall
1.2. Probability Theory 23 on this estimate are obtained by considering the distribution of possible data sets D. By contrast, from the Bayesian viewpoint there is only a single data set D (namely the one that is actually observed), and the uncertainty in the parameters is expressed through a probability distribution over w. A widely used frequentist estimator is maximum likelihood, in which w is set to the value that maximizes the likelihood function p(D|w). This corresponds to choosing the value of w for which the probability of the observed data set is maximized. In the machine learning literature, the negative log of the likelihood function is called an error function. Because the negative logarithm is a monotonically decreasing function, maximizing the likelihood is equivalent to minimizing the error. One approach to determining frequentist error bars is the bootstrap (Efron, 1979; Hastie et al., 2001), in which multiple data sets are created as follows. Suppose our original data set consists of N data points X = {x1,..., xN }. We can create a new data set XB by drawing N points at random from X, with replacement, so that some points in X may be replicated in XB, whereas other points in X may be absent from XB. This process can be repeated L times to generate L data sets each of size N and each obtained by sampling from the original data set X. The statistical accuracy of parameter estimates can then be evaluated by looking at the variability of predictions between the different bootstrap data sets. One advantage of the Bayesian viewpoint is that the inclusion of prior knowledge arises naturally. Suppose, for instance, that a fair-looking coin is tossed three times and lands heads each time. A classical maximum likelihood estimate of the Section 2.1 probability of landing heads would give 1, implying that all future tosses will land heads! By contrast, a Bayesian approach with any reasonable prior will lead to a much less extreme conclusion. There has been much controversy and debate associated with the relative merits of the frequentist and Bayesian paradigms, which have not been helped by the fact that there is no unique frequentist, or even Bayesian, viewpoint. For instance, one common criticism of the Bayesian approach is that the prior distribution is often selected on the basis of mathematical convenience rather than as a reflection of any prior beliefs. Even the subjective nature of the conclusions through their dependence on the choice of prior is seen by some as a source of difficulty. Reducing Section 2.4.3 the dependence on the prior is one motivation for so-called noninformative priors. However, these lead to difficulties when comparing different models, and indeed Bayesian methods based on poor choices of prior can give poor results with high confidence. Frequentist evaluation methods offer some protection from such probSection 1.3 lems, and techniques such as cross-validation remain useful in areas such as model comparison. This book places a strong emphasis on the Bayesian viewpoint, reflecting the huge growth in the practical importance of Bayesian methods in the past few years, while also discussing useful frequentist concepts as required. Although the Bayesian framework has its origins in the 18th century, the practical application of Bayesian methods was for a long time severely limited by the difficulties in carrying through the full Bayesian procedure, particularly the need to marginalize (sum or integrate) over the whole of parameter space, which, as we shall
24 1.INTRODUCTION see,is required in order to make predictions or to compare different models.The development of sampling methods,such as Markov chain Monte Carlo(discussed in Chapter 11)along with dramatic improvements in the speed and memory capacity of computers,opened the door to the practical use of Bayesian techniques in an im- pressive range of problem domains.Monte Carlo methods are very flexible and can be applied to a wide range of models.However,they are computationally intensive and have mainly been used for small-scale problems. More recently,highly efficient deterministic approximation schemes such as variational Bayes and expectation propagation (discussed in Chapter 10)have been developed.These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications(Blei et al.,2003). 1.2.4 The Gaussian distribution We shall devote the whole of Chapter 2 to a study of various probability dis- tributions and their key properties.It is convenient,however,to introduce here one of the most important probability distributions for continuous variables,called the normal or Gaussian distribution.We shall make extensive use of this distribution in the remainder of this chapter and indeed throughout much of the book. For the case of a single real-valued variable r,the Gaussian distribution is de- fined by N(=2m2)xp 22e- (1.46) which is governed by two parameters:u,called the mean,and o2,called the vari- ance.The square root of the variance,given by o,is called the standard deviation, and the reciprocal of the variance,written as B=1/a2,is called the precision.We shall see the motivation for these terms shortly.Figure 1.13 shows a plot of the Gaussian distribution. From the form of(1.46)we see that the Gaussian distribution satisfies W(x4,σ2)>0. (1.47) Exercise 1.7 Also it is straightforward to show that the Gaussian is normalized,so that Pierre-Simon Laplace earth is thought to have formed from the condensa- 1749-1827 tion and cooling of a large rotating disk of gas and dust.In 1812 he published the first edition of Theorie It is said that Laplace was seri-Analytique des Probabilites,in which Laplace states ously lacking in modesty and at one that "probability theory is nothing but common sense point declared himself to be the reduced to calculation".This work included a discus- best mathematician in France at the sion of the inverse probability calculation(later termed time,a claim that was arguably true. Bayes'theorem by Poincare),which he used to solve As well as being prolific in mathe-problems in life expectancy,jurisprudence,planetary matics,he also made numerous contributions to as- masses,triangulation,and error estimation. tronomy,including the nebular hypothesis by which the
24 1. INTRODUCTION see, is required in order to make predictions or to compare different models. The development of sampling methods, such as Markov chain Monte Carlo (discussed in Chapter 11) along with dramatic improvements in the speed and memory capacity of computers, opened the door to the practical use of Bayesian techniques in an impressive range of problem domains. Monte Carlo methods are very flexible and can be applied to a wide range of models. However, they are computationally intensive and have mainly been used for small-scale problems. More recently, highly efficient deterministic approximation schemes such as variational Bayes and expectation propagation (discussed in Chapter 10) have been developed. These offer a complementary alternative to sampling methods and have allowed Bayesian techniques to be used in large-scale applications (Blei et al., 2003). 1.2.4 The Gaussian distribution We shall devote the whole of Chapter 2 to a study of various probability distributions and their key properties. It is convenient, however, to introduce here one of the most important probability distributions for continuous variables, called the normal or Gaussian distribution. We shall make extensive use of this distribution in the remainder of this chapter and indeed throughout much of the book. For the case of a single real-valued variable x, the Gaussian distribution is de- fined by N * x|µ, σ2+ = 1 (2πσ2)1/2 exp , − 1 2σ2 (x − µ) 2 - (1.46) which is governed by two parameters: µ, called the mean, and σ2, called the variance. The square root of the variance, given by σ, is called the standard deviation, and the reciprocal of the variance, written as β = 1/σ2, is called the precision. We shall see the motivation for these terms shortly. Figure 1.13 shows a plot of the Gaussian distribution. From the form of (1.46) we see that the Gaussian distribution satisfies N (x|µ, σ2) > 0. (1.47) Exercise 1.7 Also it is straightforward to show that the Gaussian is normalized, so that Pierre-Simon Laplace 1749–1827 It is said that Laplace was seriously lacking in modesty and at one point declared himself to be the best mathematician in France at the time, a claim that was arguably true. As well as being prolific in mathematics, he also made numerous contributions to astronomy, including the nebular hypothesis by which the earth is thought to have formed from the condensation and cooling of a large rotating disk of gas and dust. In 1812 he published the first edition of Theorie ´ Analytique des Probabilites´ , in which Laplace states that “probability theory is nothing but common sense reduced to calculation”. This work included a discussion of the inverse probability calculation (later termed Bayes’ theorem by Poincare), which he used to solve ´ problems in life expectancy, jurisprudence, planetary masses, triangulation, and error estimation
1.2.Probability Theory 25 Figure 1.13 Plot of the univariate Gaussian showing the mean u and the N(zlu,o2) standard deviation o. 20 n4,)d=1 (1.48) Thus (1.46)satisfies the two requirements for a valid probability density. We can readily find expectations of functions of x under the Gaussian distribu- Exercise 1.8 tion.In particular,the average value of r is given by E]=N(=lu,o2)xdz=u. (1.49) Because the parameter u represents the average value of x under the distribution,it is referred to as the mean.Similarly,for the second order moment =N(e,2)r2d=2+a2 (1.50) From (1.49)and (1.50),it follows that the variance of x is given by var[]E[x2]-E(x]2=o2 (1.51) and hence o2 is referred to as the variance parameter.The maximum of a distribution Exercise 1.9 is known as its mode.For a Gaussian,the mode coincides with the mean. We are also interested in the Gaussian distribution defined over a D-dimensional vector x of continuous variables,which is given by μ,习=2e即{x-Py-x-四} 11 (1.52) where the D-dimensional vector u is called the mean,the D x D matrix is called the covariance,and denotes the determinant of >We shall make use of the multivariate Gaussian distribution briefly in this chapter,although its properties will be studied in detail in Section 2.3
1.2. Probability Theory 25 Figure 1.13 Plot of the univariate Gaussian showing the mean µ and the standard deviation σ. N (x|µ, σ2) x 2σ µ & ∞ −∞ N * x|µ, σ2+ dx = 1. (1.48) Thus (1.46) satisfies the two requirements for a valid probability density. We can readily find expectations of functions of x under the Gaussian distribuExercise 1.8 tion. In particular, the average value of x is given by E[x] = & ∞ −∞ N * x|µ, σ2+ x dx = µ. (1.49) Because the parameter µ represents the average value of x under the distribution, it is referred to as the mean. Similarly, for the second order moment E[x2] = & ∞ −∞ N * x|µ, σ2+ x2 dx = µ2 + σ2. (1.50) From (1.49) and (1.50), it follows that the variance of x is given by var[x] = E[x2] − E[x] 2 = σ2 (1.51) and hence σ2 is referred to as the variance parameter. The maximum of a distribution Exercise 1.9 is known as its mode. For a Gaussian, the mode coincides with the mean. We are also interested in the Gaussian distribution defined over a D-dimensional vector x of continuous variables, which is given by N (x|µ, Σ) = 1 (2π)D/2 1 |Σ| 1/2 exp , −1 2 (x − µ) TΣ−1 (x − µ) - (1.52) where the D-dimensional vector µ is called the mean, the D × D matrix Σ is called the covariance, and |Σ| denotes the determinant of Σ. We shall make use of the multivariate Gaussian distribution briefly in this chapter, although its properties will be studied in detail in Section 2.3