1.1.Example:Polynomial Curve Fitting 11 Table 1.2 Table of the coefficients w*for M= ln入=-o ln入=-18 ln入=0 9 polynomials with various values for 0.35 0.35 0.13 the regularization parameter A.Note that InA =-oo corresponds to a W 232.37 4.74 -0.05 model with no regularization,i.e.,to w吃 -5321.83 -0.77 -0.06 the graph at the bottom right in Fig- w 48568.31 -31.97 -0.05 ure 1.4.We see that,as the value of WA -231639.30 -3.89 -0.03 A increases,the typical magnitude of 640042.26 55.28 -0.02 the coefficients gets smaller. 心 D哈 -1061800.52 41.32 -0.01 w克 1042400.18 -45.95 -0.00 w岭 -557682.99 -91.53 0.00 w哈 125201.43 72.68 0.01 the magnitude of the coefficients. The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error (1.3)for both training and test sets against In A, as shown in Figure 1.8.We see that in effect A now controls the effective complexity of the model and hence determines the degree of over-fitting. The issue of model complexity is an important one and will be discussed at length in Section 1.3.Here we simply note that,if we were trying to solve a practical application using this approach of minimizing an error function,we would have to find a way to determine a suitable value for the model complexity.The results above suggest a simple way of achieving this,namely by taking the available data and partitioning it into a training set,used to determine the coefficients w,and a separate validation set,also called a hold-out set,used to optimize the model complexity (either M or A).In many cases,however,this will prove to be too wasteful of Section 1.3 valuable training data,and we have to seek more sophisticated approaches. So far our discussion of polynomial curve fitting has appealed largely to in- tuition.We now seek a more principled approach to solving problems in pattern recognition by turning to a discussion of probability theory.As well as providing the foundation for nearly all of the subsequent developments in this book,it will also Figure 1.8 Graph of the root-mean-square er- ror (1.3)versus In A for the M=9 polynomial. Training Test 0.5 -35 -30m入-25 -20
1.1. Example: Polynomial Curve Fitting 11 Table 1.2 Table of the coefficients w⋆ for M = 9 polynomials with various values for the regularization parameter λ. Note that ln λ = −∞ corresponds to a model with no regularization, i.e., to the graph at the bottom right in Figure 1.4. We see that, as the value of λ increases, the typical magnitude of the coefficients gets smaller. ln λ = −∞ ln λ = −18 ln λ = 0 w⋆ 0 0.35 0.35 0.13 w⋆ 1 232.37 4.74 -0.05 w⋆ 2 -5321.83 -0.77 -0.06 w⋆ 3 48568.31 -31.97 -0.05 w⋆ 4 -231639.30 -3.89 -0.03 w⋆ 5 640042.26 55.28 -0.02 w⋆ 6 -1061800.52 41.32 -0.01 w⋆ 7 1042400.18 -45.95 -0.00 w⋆ 8 -557682.99 -91.53 0.00 w⋆ 9 125201.43 72.68 0.01 the magnitude of the coefficients. The impact of the regularization term on the generalization error can be seen by plotting the value of the RMS error (1.3) for both training and test sets against ln λ, as shown in Figure 1.8. We see that in effect λ now controls the effective complexity of the model and hence determines the degree of over-fitting. The issue of model complexity is an important one and will be discussed at length in Section 1.3. Here we simply note that, if we were trying to solve a practical application using this approach of minimizing an error function, we would have to find a way to determine a suitable value for the model complexity. The results above suggest a simple way of achieving this, namely by taking the available data and partitioning it into a training set, used to determine the coefficients w, and a separate validation set, also called a hold-out set, used to optimize the model complexity (either M or λ). In many cases, however, this will prove to be too wasteful of Section 1.3 valuable training data, and we have to seek more sophisticated approaches. So far our discussion of polynomial curve fitting has appealed largely to intuition. We now seek a more principled approach to solving problems in pattern recognition by turning to a discussion of probability theory. As well as providing the foundation for nearly all of the subsequent developments in this book, it will also Figure 1.8 Graph of the root-mean-square error (1.3) versus ln λ for the M = 9 polynomial. ERMS ln λ −35 −30 −25 −20 0 0.5 1 Training Test
12 1.INTRODUCTION give us some important insights into the concepts we have introduced in the con- text of polynomial curve fitting and will allow us to extend these to more complex situations. 1.2.Probability Theory A key concept in the field of pattern recognition is that of uncertainty.It arises both through noise on measurements,as well as through the finite size of data sets.Prob- ability theory provides a consistent framework for the quantification and manipula- tion of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory,discussed in Section 1.5,it allows us to make optimal predictions given all the information available to us,even though that infor- mation may be incomplete or ambiguous. We will introduce the basic concepts of probability theory by considering a sim- ple example.Imagine we have two boxes,one red and one blue,and in the red box we have 2 apples and 6 oranges,and in the blue box we have 3 apples and I orange. This is illustrated in Figure 1.9.Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit,and having observed which sort of fruit it is we replace it in the box from which it came.We could imagine repeating this process many times.Let us suppose that in so doing we pick the red box 40%of the time and we pick the blue box 60%of the time,and that when we remove an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box. In this example,the identity of the box that will be chosen is a random variable, which we shall denote by B.This random variable can take one of two possible values,namely r (corresponding to the red box)or b(corresponding to the blue box).Similarly,the identity of the fruit is also a random variable and will be denoted by F.It can take either of the values a (for apple)or o (for orange). To begin with,we shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials,in the limit that the total number of trials goes to infinity.Thus the probability of selecting the red box is 4/10 Figure 1.9 We use a simple example of two coloured boxes each containing fruit (apples shown in green and or- anges shown in orange)to intro- duce the basic ideas of probability. 88
12 1. INTRODUCTION give us some important insights into the concepts we have introduced in the context of polynomial curve fitting and will allow us to extend these to more complex situations. 1.2. Probability Theory A key concept in the field of pattern recognition is that of uncertainty. It arises both through noise on measurements, as well as through the finite size of data sets. Probability theory provides a consistent framework for the quantification and manipulation of uncertainty and forms one of the central foundations for pattern recognition. When combined with decision theory, discussed in Section 1.5, it allows us to make optimal predictions given all the information available to us, even though that information may be incomplete or ambiguous. We will introduce the basic concepts of probability theory by considering a simple example. Imagine we have two boxes, one red and one blue, and in the red box we have 2 apples and 6 oranges, and in the blue box we have 3 apples and 1 orange. This is illustrated in Figure 1.9. Now suppose we randomly pick one of the boxes and from that box we randomly select an item of fruit, and having observed which sort of fruit it is we replace it in the box from which it came. We could imagine repeating this process many times. Let us suppose that in so doing we pick the red box 40% of the time and we pick the blue box 60% of the time, and that when we remove an item of fruit from a box we are equally likely to select any of the pieces of fruit in the box. In this example, the identity of the box that will be chosen is a random variable, which we shall denote by B. This random variable can take one of two possible values, namely r (corresponding to the red box) or b (corresponding to the blue box). Similarly, the identity of the fruit is also a random variable and will be denoted by F. It can take either of the values a (for apple) or o (for orange). To begin with, we shall define the probability of an event to be the fraction of times that event occurs out of the total number of trials, in the limit that the total number of trials goes to infinity. Thus the probability of selecting the red box is 4/10 Figure 1.9 We use a simple example of two coloured boxes each containing fruit (apples shown in green and oranges shown in orange) to introduce the basic ideas of probability
1.2.Probability Theory 13 Figure 1.10 We can derive the sum and product rules of probability by considering two random variables,X,which takes the values fr}where i=1,...,M,and Y,which takes the values [y}where j=1,...,L In this illustration we have M =5 and L 3.If we consider a total number N of instances of these variables,then we denote the number of instances where X =xi and Y=y;by n,which is the number of yj points in the corresponding cell of the array.The number of points in column i,corresponding to X =z,is denoted by ci,and the number of points in row j,corresponding to Y=y,is denoted by ri. and the probability of selecting the blue box is 6/10.We write these probabilities as p(B =r)=4/10 and p(B =6)=6/10.Note that,by definition,probabilities must lie in the interval 0,1.Also,if the events are mutually exclusive and if they include all possible outcomes (for instance,in this example the box must be either red or blue),then we see that the probabilities for those events must sum to one. We can now ask questions such as:"what is the overall probability that the se- lection procedure will pick an apple?",or"given that we have chosen an orange, what is the probability that the box we chose was the blue one?".We can answer questions such as these,and indeed much more complex questions associated with problems in pattern recognition,once we have equipped ourselves with the two el- ementary rules of probability,known as the sum rule and the product rule.Having obtained these rules,we shall then return to our boxes of fruit example. In order to derive the rules of probability,consider the slightly more general ex- ample shown in Figure 1.10 involving two random variables X and Y(which could for instance be the Box and Fruit variables considered above).We shall suppose that X can take any of the values ri where i =1,...,M,and Y can take the values yj where j =1,...,L.Consider a total of N trials in which we sample both of the variables X and Y,and let the number of such trials in which X =zi and Y=yj be nij.Also,let the number of trials in which X takes the value ;(irrespective of the value that Y takes)be denoted by ci,and similarly let the number of trials in which Y takes the value yi be denoted by ri. The probability that X will take the value zi and Y will take the value yi is written p(X =i,Y =yj)and is called the joint probability of X zi and Y=y.It is given by the number of points falling in the cell i.j as a fraction of the total number of points,and hence i订 p(X =zi,Y=)=N (1.5) Here we are implicitly considering the limit N-oo.Similarly,the probability that X takes the value xi irrespective of the value of Y is written as p(X =zi)and is given by the fraction of the total number of points that fall in column i,so that p(X=)=N (1.6) Because the number of instances in column i in Figure 1.10 is just the sum of the number of instances in each cell of that column,we have cinij and therefore
1.2. Probability Theory 13 Figure 1.10 We can derive the sum and product rules of probability by considering two random variables, X, which takes the values {xi} where i = 1,...,M, and Y , which takes the values {yj} where j = 1,...,L. In this illustration we have M = 5 and L = 3. If we consider a total number N of instances of these variables, then we denote the number of instances where X = xi and Y = yj by nij , which is the number of points in the corresponding cell of the array. The number of points in column i, corresponding to X = xi, is denoted by ci, and the number of points in row j, corresponding to Y = yj , is denoted by rj . } } ci yj rj xi nij and the probability of selecting the blue box is 6/10. We write these probabilities as p(B = r)=4/10 and p(B = b)=6/10. Note that, by definition, probabilities must lie in the interval [0, 1]. Also, if the events are mutually exclusive and if they include all possible outcomes (for instance, in this example the box must be either red or blue), then we see that the probabilities for those events must sum to one. We can now ask questions such as: “what is the overall probability that the selection procedure will pick an apple?”, or “given that we have chosen an orange, what is the probability that the box we chose was the blue one?”. We can answer questions such as these, and indeed much more complex questions associated with problems in pattern recognition, once we have equipped ourselves with the two elementary rules of probability, known as the sum rule and the product rule. Having obtained these rules, we shall then return to our boxes of fruit example. In order to derive the rules of probability, consider the slightly more general example shown in Figure 1.10 involving two random variables X and Y (which could for instance be the Box and Fruit variables considered above). We shall suppose that X can take any of the values xi where i = 1,...,M, and Y can take the values yj where j = 1,...,L. Consider a total of N trials in which we sample both of the variables X and Y , and let the number of such trials in which X = xi and Y = yj be nij . Also, let the number of trials in which X takes the value xi (irrespective of the value that Y takes) be denoted by ci, and similarly let the number of trials in which Y takes the value yj be denoted by rj . The probability that X will take the value xi and Y will take the value yj is written p(X = xi, Y = yj ) and is called the joint probability of X = xi and Y = yj . It is given by the number of points falling in the cell i,j as a fraction of the total number of points, and hence p(X = xi, Y = yj ) = nij N . (1.5) Here we are implicitly considering the limit N → ∞. Similarly, the probability that X takes the value xi irrespective of the value of Y is written as p(X = xi) and is given by the fraction of the total number of points that fall in column i, so that p(X = xi) = ci N . (1.6) Because the number of instances in column i in Figure 1.10 is just the sum of the number of instances in each cell of that column, we have ci = % j nij and therefore
14 1.INTRODUCTION from (1.5)and (1.6),we have L p(X=x)=∑p(X=E,Y=) (1.7) j=1 which is the sum rule of probability.Note that p(X i)is sometimes called the marginal probability,because it is obtained by marginalizing,or summing out,the other variables (in this case Y). If we consider only those instances for which X zi,then the fraction of such instances for which Y=yj is written p(Y==i)and is called the conditional probability of Y =yj given X =zi.It is obtained by finding the fraction of those points in column i that fall in cell i,j and hence is given by pY=5X=x)= (1.8) Ci From(1.5),(1.6),and (1.8),we can then derive the following relationship mX=Y=)=祭=2发 ci N p(Y=yjX =zi)p(X=xi) (1.9) which is the product rule of probability. So far we have been quite careful to make a distinction between a random vari- able,such as the box B in the fruit example,and the values that the random variable can take,for example r if the box were the red one.Thus the probability that B takes the value r is denoted p(B =r).Although this helps to avoid ambiguity,it leads to a rather cumbersome notation,and in many cases there will be no need for such pedantry.Instead,we may simply write p(B)to denote a distribution over the ran- dom variable B,or p(r)to denote the distribution evaluated for the particular value r,provided that the interpretation is clear from the context. With this more compact notation,we can write the two fundamental rules of probability theory in the following form. The Rules of Probability sum rule nX)=∑p(X,Y) (1.10) product rule p(X,Y)=p(Y X)p(X). (1.11) Here p(X,Y)is a joint probability and is verbalized as "the probability of X and Y.Similarly,the quantity p(YX)is a conditional probability and is verbalized as "the probability of Y given X",whereas the quantity p(X)is a marginal probability
14 1. INTRODUCTION from (1.5) and (1.6), we have p(X = xi) = " L j=1 p(X = xi, Y = yj ) (1.7) which is the sum rule of probability. Note that p(X = xi) is sometimes called the marginal probability, because it is obtained by marginalizing, or summing out, the other variables (in this case Y ). If we consider only those instances for which X = xi, then the fraction of such instances for which Y = yj is written p(Y = yj |X = xi) and is called the conditional probability of Y = yj given X = xi. It is obtained by finding the fraction of those points in column i that fall in cell i,j and hence is given by p(Y = yj |X = xi) = nij ci . (1.8) From (1.5), (1.6), and (1.8), we can then derive the following relationship p(X = xi, Y = yj ) = nij N = nij ci · ci N = p(Y = yj |X = xi)p(X = xi) (1.9) which is the product rule of probability. So far we have been quite careful to make a distinction between a random variable, such as the box B in the fruit example, and the values that the random variable can take, for example r if the box were the red one. Thus the probability that B takes the value r is denoted p(B = r). Although this helps to avoid ambiguity, it leads to a rather cumbersome notation, and in many cases there will be no need for such pedantry. Instead, we may simply write p(B) to denote a distribution over the random variable B, or p(r) to denote the distribution evaluated for the particular value r, provided that the interpretation is clear from the context. With this more compact notation, we can write the two fundamental rules of probability theory in the following form. The Rules of Probability sum rule p(X) = " Y p(X, Y ) (1.10) product rule p(X, Y ) = p(Y |X)p(X). (1.11) Here p(X, Y ) is a joint probability and is verbalized as “the probability of X and Y ”. Similarly, the quantity p(Y |X) is a conditional probability and is verbalized as “the probability of Y given X”, whereas the quantity p(X) is a marginal probability
1.2.Probability Theory 15 and is simply "the probability of X".These two simple rules form the basis for all of the probabilistic machinery that we use throughout this book. From the product rule,together with the symmetry property p(X,Y)=p(Y,X), we immediately obtain the following relationship between conditional probabilities (YIX)=(XIY)p(Y) (1.12) p(X) which is called Bayes'theorem and which plays a central role in pattern recognition and machine learning.Using the sum rule,the denominator in Bayes'theorem can be expressed in terms of the quantities appearing in the numerator pN)=∑(x)p(Y) (1.13) We can view the denominator in Bayes'theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of (1.12)over all values of Y equals one. In Figure 1.11,we show a simple example involving a joint distribution over two variables to illustrate the concept of marginal and conditional distributions.Here a finite sample of N =60 data points has been drawn from the joint distribution and is shown in the top left.In the top right is a histogram of the fractions of data points having each of the two values of Y.From the definition of probability,these fractions would equal the corresponding probabilities p(Y)in the limit N-oo.We can view the histogram as a simple way to model a probability distribution given only a finite number of points drawn from that distribution.Modelling distributions from data lies at the heart of statistical pattern recognition and will be explored in great detail in this book.The remaining two plots in Figure 1.11 show the corresponding histogram estimates of p(X)and p(XY 1). Let us now return to our example involving boxes of fruit.For the moment,we shall once again be explicit about distinguishing between the random variables and their instantiations.We have seen that the probabilities of selecting either the red or the blue boxes are given by p(B=r)=4/10 (1.14) p(B=b)=6/10 (1.15) respectively.Note that these satisfy p(B =r)+p(B =b)=1. Now suppose that we pick a box at random,and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue box which is 3/4,and so p(F=aB =b)=3/4.In fact,we can write out all four conditional probabilities for the type of fruit,given the selected box p(F=aB=r)=1/4 (1.16) p(F=olB=r)=3/4 (1.17) p(F=aB=b)=3/4 (1.18) p(F=olB=b)=1/4. (1.19)
1.2. Probability Theory 15 and is simply “the probability of X”. These two simple rules form the basis for all of the probabilistic machinery that we use throughout this book. From the product rule, together with the symmetry property p(X, Y ) = p(Y,X), we immediately obtain the following relationship between conditional probabilities p(Y |X) = p(X|Y )p(Y ) p(X) (1.12) which is called Bayes’ theorem and which plays a central role in pattern recognition and machine learning. Using the sum rule, the denominator in Bayes’ theorem can be expressed in terms of the quantities appearing in the numerator p(X) = " Y p(X|Y )p(Y ). (1.13) We can view the denominator in Bayes’ theorem as being the normalization constant required to ensure that the sum of the conditional probability on the left-hand side of (1.12) over all values of Y equals one. In Figure 1.11, we show a simple example involving a joint distribution over two variables to illustrate the concept of marginal and conditional distributions. Here a finite sample of N = 60 data points has been drawn from the joint distribution and is shown in the top left. In the top right is a histogram of the fractions of data points having each of the two values of Y . From the definition of probability, these fractions would equal the corresponding probabilities p(Y ) in the limit N → ∞. We can view the histogram as a simple way to model a probability distribution given only a finite number of points drawn from that distribution. Modelling distributions from data lies at the heart of statistical pattern recognition and will be explored in great detail in this book. The remaining two plots in Figure 1.11 show the corresponding histogram estimates of p(X) and p(X|Y = 1). Let us now return to our example involving boxes of fruit. For the moment, we shall once again be explicit about distinguishing between the random variables and their instantiations. We have seen that the probabilities of selecting either the red or the blue boxes are given by p(B = r)=4/10 (1.14) p(B = b)=6/10 (1.15) respectively. Note that these satisfy p(B = r) + p(B = b)=1. Now suppose that we pick a box at random, and it turns out to be the blue box. Then the probability of selecting an apple is just the fraction of apples in the blue box which is 3/4, and so p(F = a|B = b)=3/4. In fact, we can write out all four conditional probabilities for the type of fruit, given the selected box p(F = a|B = r)=1/4 (1.16) p(F = o|B = r)=3/4 (1.17) p(F = a|B = b)=3/4 (1.18) p(F = o|B = b)=1/4. (1.19)