20 STATISTICAL LEARNING METHODS In which we view learning as a form of uncertain reasoning from observations Part V pointed out the prevalence of uncertainty in real environments.Agents can handle uncertainty by using the methods of probability and decision theory,but first they must learn their probabilistic theories of the world from experience.This chapter explains how they can do that.We will see how to formulate the leamning task itself as a process of probabilistic inference(Section 20.1).We will see that a Bayesian view of learning is extremely powerful,providing general solutions to the problems of noise,overfitting,and optimal prediction.It also takes into account the fact that a less-than-omniscient agent can never be certain about which theory of the world is correct,yet must still make decisions by using some theory of the world. We describe methods for learning probability models-primarily Bavesian networks- in Sections 20.2 and 20.3.Section 20.4 looks at learning methods that store and recall specific nstances.Section 20.5 covers neural network learning and Section 206 introduces kernel ofthe mat rial in this chanter is fairl mathe atical (requiring a basic un erstanding of multivariate calculus).although the general lessons can be understood without unging into the details It may benefit the Ch eader at this point to review the material in ndix A 20.1 STATISTICAL LEARNING The key concepts in this chapter,just as in Chapter 18,are data and hypotheses.Here,the data are evidence-that is,instantiations of some or all of the random variables describing the domain.The hypotheses are probabilistic theories of how the domain works,including logical theories as a special case. Let us consider a very simple example.Our favorite Surprise candy comes in two flavors:cherry (yum)and lime(ugh).The candy manufacturer has a peculiar sense of humor and wraps each piece of candy in the same opaque wrapper,regardless of flavor.The candy is sold in very large bags,of which there are known to be five kinds-again,indistinguishable from the outside: 712
20 STATISTICAL LEARNING METHODS In which we view learning as a form of uncertain reasoning from observations. Part V pointed out the prevalence of uncertainty in real environments. Agents can handle uncertainty by using the methods of probability and decision theory, but first they must learn their probabilistic theories of the world from experience. This chapter explains how they can do that. We will see how to formulate the learning task itself as a process of probabilistic inference (Section 20.1). We will see that a Bayesian view of learning is extremely powerful, providing general solutions to the problems of noise, overfitting, and optimal prediction. It also takes into account the fact that a less-than-omniscient agent can never be certain about which theory of the world is correct, yet must still make decisions by using some theory of the world. We describe methods for learning probability models—primarily Bayesian networks— in Sections 20.2 and 20.3. Section 20.4 looks at learning methods that store and recallspecific instances. Section 20.5 covers neural network learning and Section 20.6 introduces kernel machines. Some of the material in this chapter is fairly mathematical (requiring a basic understanding of multivariate calculus), although the general lessons can be understood without plunging into the details. It may benefit the reader at this point to review the material in Chapters 13 and 14 and to peek at the mathematical background in Appendix A. 20.1 STATISTICAL LEARNING The key concepts in this chapter, just as in Chapter 18, are data and hypotheses. Here, the data are evidence—that is, instantiations of some or all of the random variables describing the domain. The hypotheses are probabilistic theories of how the domain works, including logical theories as a special case. Let us consider a very simple example. Our favorite Surprise candy comes in two flavors: cherry (yum) and lime (ugh). The candy manufacturer has a peculiar sense of humor and wraps each piece of candy in the same opaque wrapper, regardless of flavor. The candy is sold in very large bags, of which there are known to be five kinds—again, indistinguishable from the outside: 712
Section 20.1.Statistical Learning 713 100%cherry 12 4 cherry +75%lime h:100%lime Given a new bag of candy the random variable h (for hpothesis)denotes the type of the bag.with possible values through hs.H is not directly observable.of course.As the pieces of candy are opened and inspected.data are revealed-DD2. D.where each D.is a random variable with possible values chey and lime.The basic task faced by the agent is to predict the flavor of the next picce of candy Despite its apparent triviality this rves to introduce many of the major issues.The agent really does need to infer a theory of its world.albeit a ver esian lea ning sin po this edu D btain P(hild)=aP(dlhi)P(hi). (20.1) Now,suppose we want to make a prediction about an unknown quantity X.Then we have P(xld)=>P(Xld,hi)P(hild)=>P(X ha)P(hild), (20.2) where we have assumed that each hypothesis determines a probability distribution overX This equation shows that predictions are weighted averages over the predictions of the indi- vidual hypotheses.The hypotheses themselves are essentially"intermediaries"between the raw data and the predictions.The key quantities in the Bayesian approach are the hypothesis HYPOTHESIS PRIOR prior,P(h),and the likelihood of the data under each hypothesis,P(dh). LIKELHOOD For our candy example,we will assume for the time being that the prior distribution overh...h is given by (0.1,0.2,0.4,0.2,0.1),as advertised by the manufacturer.The likelihood of the data is calculated under the assumption that the observations are i.i.d.-that is,independently and identically distributed-so that P(dlhi)=IIP(dilhi) (20.3) lime-ther P is realym ba (s)and the first candies are al because nall the candies in an h3 bag are lime.Figure 20.1(a) shows how the posterio r probabilities of the five hypotheses change as the sequence of 10 ndies is observed.Notice that the probabilities start out at their prior values,so h is initially the most likely choice and remains so after I lime candy is unwrapped.After 2 see Exercise 20.3
Section 20.1. Statistical Learning 713 h1: 100% cherry h2: 75% cherry + 25% lime h3: 50% cherry + 50% lime h4: 25% cherry + 75% lime h5: 100% lime Given a new bag of candy, the random variable H (for hypothesis) denotes the type of the bag, with possible values h1 through h5. H is not directly observable, of course. As the pieces of candy are opened and inspected, data are revealed—D1, D2, . . ., DN , where each Di is a random variable with possible values cherry and lime. The basic task faced by the agent is to predict the flavor of the next piece of candy. 1 Despite its apparent triviality, this scenario serves to introduce many of the major issues. The agent really does need to infer a theory of its world, albeit a very simple one. BAYESIAN LEARNING Bayesian learning simply calculates the probability of each hypothesis, given the data, and makes predictions on that basis. That is, the predictions are made by using all the hypotheses, weighted by their probabilities, rather than by using just a single “best” hypothesis. In this way, learning is reduced to probabilistic inference. Let D represent all the data, with observed value d; then the probability of each hypothesis is obtained by Bayes’ rule: P(hi |d) = αP(d|hi)P(hi) . (20.1) Now, suppose we want to make a prediction about an unknown quantity X. Then we have P(X|d) = X i P(X|d, hi)P(hi |d) = X i P(X|hi)P(hi |d) , (20.2) where we have assumed that each hypothesis determines a probability distribution over X. This equation shows that predictions are weighted averages over the predictions of the individual hypotheses. The hypotheses themselves are essentially “intermediaries” between the raw data and the predictions. The key quantities in the Bayesian approach are the hypothesis HYPOTHESIS PRIOR prior, P(hi), and the likelihood of the data under each hypothesis, P(d|hi). LIKELIHOOD For our candy example, we will assume for the time being that the prior distribution over h1, . . . , h5 is given by h0.1, 0.2, 0.4, 0.2, 0.1i, as advertised by the manufacturer. The I.I.D. likelihood of the data is calculated under the assumption that the observations are i.i.d.—that is, independently and identically distributed—so that P(d|hi) = Y j P(dj |hi) . (20.3) For example, suppose the bag is really an all-lime bag (h5) and the first 10 candies are all lime; then P(d|h3) is 0.5 10 , because half the candies in an h3 bag are lime.2 Figure 20.1(a) shows how the posterior probabilities of the five hypotheses change as the sequence of 10 lime candies is observed. Notice that the probabilities start out at their prior values, so h3 is initially the most likely choice and remains so after 1 lime candy is unwrapped. After 2 1 Statistically sophisticated readers will recognize this scenario as a variant of the urn-and-ball setup. We find urns and balls less compelling than candy; furthermore, candy lends itself to other tasks, such as deciding whether to trade the bag with a friend—see Exercise 20.3. 2 We stated earlier that the bags of candy are very large; otherwise, the i.i.d. assumption failsto hold. Technically, it is more correct (but less hygienic) to rewrap each candy after inspection and return it to the bag
714 Chapter 20. Statistical Learning Methods 0.9 0 ◆ 0.8 。 0.6 0 8 0 10 les in les in d (a) Figure20.1 (a)Posterior probabilities P(hald1.. dy)from Equation(20.1).The num ber of observations N ranges from 1 to 10,and each observation is of a lime candy.(b) Bayesian prediction P(dN+=limeld....dN)from Equation(20.2). lime e candies ar unw apped,is most likely after 3 or more,(the dreaded all- ime bag is the most likely After In a row,we are fai 0.1(b)show the predi ted probability that the next candy is lime,based on Equation (20.2).As we would expec .it increas monotonically toward The example shows that the true hypothesis eventually dominates the Bayesian pred tion This is characteristic of Bayesian leaming.For any fixed prior that does not rule out the true hypothesis,the posterior probability ofany false hypothesis will eventually vanish,sim ply because the probability of generating "uncharacteristic"data indefinitely is vanishingl small.(This point is analogous to one made in the discussion of PAC learning in Chapter 18.) More importantly,the Bayesian prediction is oprimal,whether the data set be small or large. Given the hypothesis prior,any other prediction will be correct less often. The optimality of Bayesian learning comes at a price,of course.For real learning problems,the hypothesis space is usually very large or infinite,as we saw in Chapter 18.In some cases,the summation in Equation(20.2)(or integration,in the continuous case)can be carried out tractably,but in most cases we must resort to approximate or simplified methods. A very common approximation- -one that is usually adopted in science- -is to make pre dictions based on a single most probable hypothesis- that is,an h;that maximizes P(hd) A This is often called a maximum a posteriori or MAP (pronounced"em-ay-pee")hypothe- sis.Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the extent that P(XId)P(XMAP).In our candy example,hMAP=hs after three lime candies in a row.so the MAP learner then predicts that the fourth candy is lime with prob ability 1.0-a much more dangerous prediction than the Bayesian prediction of 0.8 shown in Figure 20.1.As more data arrive,the MAP and Bayesian predictions become closer,be- cause the competitors to the MAP hypothesis become less and less probable.Although our example doesn't show it,finding MAP hypotheses is often much easier than Bayesian learn-
714 Chapter 20. Statistical Learning Methods 0 0.2 0.4 0.6 0.8 1 0 2 4 6 8 10 Posterior probability of hypothesis Number of samples in d P(h1 | d) P(h2 | d) P(h3 | d) P(h4 | d) P(h5 | d) 0.4 0.5 0.6 0.7 0.8 0.9 1 0 2 4 6 8 10 Probability that next candy is lime Number of samples in d (a) (b) Figure 20.1 (a) Posterior probabilities P(hi |d1, . . . , dN ) from Equation (20.1). The number of observations N ranges from 1 to 10, and each observation is of a lime candy. (b) Bayesian prediction P(dN+1 = lime|d1, . . . , dN ) from Equation (20.2). lime candies are unwrapped, h4 is most likely; after 3 or more, h5 (the dreaded all-lime bag) is the most likely. After 10 in a row, we are fairly certain of our fate. Figure 20.1(b) shows the predicted probability that the next candy is lime, based on Equation (20.2). As we would expect, it increases monotonically toward 1. The example shows that the true hypothesis eventually dominates the Bayesian prediction. This is characteristic of Bayesian learning. For any fixed prior that does not rule out the true hypothesis, the posterior probability of any false hypothesis will eventually vanish, simply because the probability of generating “uncharacteristic” data indefinitely is vanishingly small. (This point is analogous to one made in the discussion of PAC learning in Chapter 18.) More importantly, the Bayesian prediction is optimal, whether the data set be small or large. Given the hypothesis prior, any other prediction will be correct less often. The optimality of Bayesian learning comes at a price, of course. For real learning problems, the hypothesis space is usually very large or infinite, as we saw in Chapter 18. In some cases, the summation in Equation (20.2) (or integration, in the continuous case) can be carried out tractably, but in most cases we must resort to approximate or simplified methods. A very common approximation—one that is usually adopted in science—isto make predictions based on a single most probable hypothesis—that is, an hi that maximizes P(hi |d). This is often called a maximum a posteriori or MAP (pronounced “em-ay-pee”) hypothe- MAXIMUM A POSTERIORI sis. Predictions made according to an MAP hypothesis hMAP are approximately Bayesian to the extent that P(X|d) ≈ P(X|hMAP). In our candy example, hMAP = h5 after three lime candies in a row, so the MAP learner then predicts that the fourth candy is lime with probability 1.0—a much more dangerous prediction than the Bayesian prediction of 0.8 shown in Figure 20.1. As more data arrive, the MAP and Bayesian predictions become closer, because the competitors to the MAP hypothesis become less and less probable. Although our example doesn’t show it, finding MAP hypotheses is often much easier than Bayesian learn-
Section 20.1. Statistical Learning 715 ing.because it integration)problem. We will see er in the chapter. portantrol ing.the hypothesis prior P(ha)plays an im Chapte en the hypoth s too expre nat it co at fit the data set wel ner tha itrary limit on the hypotheses to be considered,Bayesian and MAP learning use the prior to pem complexiry.Iypically,more complex hypoth s nave a lower prior probability- n part because there are usually many more complex hypotheses than simple hypotheses.On the other hand,more complex hypotheses have a greater capac ity to fit the data.(In the extreme case,a lookup table can reproduce the data exactly with probability 1.)Hence,the hypothesis prior embodies a trade-off between the complexity ofa hypothesis and its degree of fit to the data. We can see the effe ct of this trade-off most clearly in the logical case,where contains only deterministic hypotheses.In that case.P(d h;)is I if h;is consistent and 0 otherwise Looking at Equation (20.1),we see that hMAP will then be the simplest logical theory tha is consistent with the data.Therefore,maximum a posteriori learning provides a natura embodiment of Ockham's razor. Another insight into the trade-off between complexity and degree of fit is obtained by taking the logarithm of Equation (20.1).Choosing hMAP to maximize P(dh)P(h) is equivalent to minimizing -log2 P(d hi)-logz P(hi) Using the connection between information encoding and probability that we introduced in Chapter 18,we see that the- the hvy re 1 g2 P(dh )is the additional number of bits re the (To ee this consider that o bits edicts the data ith he ng ime 1- 0)Henc P MAP I rning is choosing the hyp ide sion of the data The me task is addr tly b or MDI g m the ncoding athe with A final n ie 6. ing a unifo er the spa ce of hy In tha as MAP I es t s a th d 1 aH. sta ch tru nto pre esi prior er a p appro》 imation and MAR e da arg t it has problems(as we
Section 20.1. Statistical Learning 715 ing, because it requires solving an optimization problem instead of a large summation (or integration) problem. We will see examples of this later in the chapter. In both Bayesian learning and MAP learning, the hypothesis prior P(hi) plays an important role. We saw in Chapter 18 that overfitting can occur when the hypothesis space is too expressive, so that it contains many hypotheses that fit the data set well. Rather than placing an arbitrary limit on the hypotheses to be considered, Bayesian and MAP learning methods use the prior to penalize complexity. Typically, more complex hypotheses have a lower prior probability—in part because there are usually many more complex hypotheses than simple hypotheses. On the other hand, more complex hypotheses have a greater capacity to fit the data. (In the extreme case, a lookup table can reproduce the data exactly with probability 1.) Hence, the hypothesis prior embodies a trade-off between the complexity of a hypothesis and its degree of fit to the data. We can see the effect of this trade-off most clearly in the logical case, where H contains only deterministic hypotheses. In that case, P(d|hi) is 1 if hi is consistent and 0 otherwise. Looking at Equation (20.1), we see that hMAP will then be the simplest logical theory that is consistent with the data. Therefore, maximum a posteriori learning provides a natural embodiment of Ockham’s razor. Another insight into the trade-off between complexity and degree of fit is obtained by taking the logarithm of Equation (20.1). Choosing hMAP to maximize P(d|hi)P(hi) is equivalent to minimizing − log2 P(d|hi) − log2 P(hi) . Using the connection between information encoding and probability that we introduced in Chapter 18, we see that the − log2 P(hi) term equals the number of bits required to specify the hypothesis hi . Furthermore, − log2 P(d|hi) is the additional number of bits required to specify the data, given the hypothesis. (To see this, consider that no bits are required if the hypothesis predicts the data exactly—as with h5 and the string of lime candies—and log2 1 = 0.) Hence, MAP learning is choosing the hypothesis that provides maximum compression of the data. The same task is addressed more directly by the minimum description length, or MDL, learning method, which attempts to minimize the size of hypothesis and MINIMUM DESCRIPTION LENGTH data encodings rather than work with probabilities. A final simplification is provided by assuming a uniform prior over the space of hypotheses. In that case, MAP learning reduces to choosing an hi that maximizes P(d|Hi). This is called a maximum-likelihood (ML) hypothesis, hML. Maximum-likelihood learning MAXIMUMLIKELIHOOD is very common in statistics, a discipline in which many researchers distrust the subjective nature of hypothesis priors. It is a reasonable approach when there is no reason to prefer one hypothesis over another a priori—for example, when all hypotheses are equally complex. It provides a good approximation to Bayesian and MAP learning when the data set is large, because the data swamps the prior distribution over hypotheses, but it has problems (as we shall see) with small data sets
716 Chapter 20.Statistical Learning Methods 20.2 LEARNING WITH COMPLETE DATA par neter learning with comple te data A parameter learning ta el or a proba ample,we might be n learning the condi I probab d.Fore ilities in a Baye esian network ven su are complete for e ble in the probability model ing learned data greatly simplify the e parameters of a Maximum-likelihood parameter learning:Discrete models Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime-cherry nortions are completely unknown-that is the fraction could he anywhere between o and 1.In that case.we have a continuum of hypotheses.The parameter in this case,which we call 6.is the p qually likely a priori.then a maximum- likelihod approach iseb Ife modeltheinwth Bayesianrke need just one andom variable flavor (the flayor ofa randomly chosen candy from the bag) It has values cherry and lime.where the probability of cherry is(sce Figure 20.2(a)).Now we ap N candies.of which c are cherries and (=N-c are limes.According P(dlhe)=TT P(djlhe)=0.(1-0) 21 The maximum-likelihood hypothesis is given by the value of that maximizes this expres- sion.The same value is obtained by maximizing the log likelihood. L(diho)=log P(dlhe)=>log P(dho)=elog+elog(10) (By taking logarithms,we reduce the product to a sum over the data,which is usually easier o maximize.)To find the maximum-likelihood value of.we differentiate L with respect to and set the resulting expression toero dL(dlho)= 01-00 →0= += In English,then,the maximum-likelihood hypothesish asserts that the actual proportion of cherries in the bag is equal to the observed proportion in the candies unwrapped so farl It appears that we have done a lot of work to discover the obvious.In fact,though,we have laid out one standard method for maximum-likelihood parameter learning: 1.Write down an expression for the likelihood ofthe data as a function of the parameter(s) 2.Write down the derivative of the log likelihood with respect to each parameter 3.Find the parameter values such that the derivatives are zero
716 Chapter 20. Statistical Learning Methods 20.2 LEARNING WITH COMPLETE DATA Our development of statistical learning methods begins with the simplest task: parameter learning with complete data. A parameter learning task involves finding the numerical pa- PARAMETER LEARNING COMPLETE DATA rametersfor a probability model whose structure is fixed. For example, we might be interested in learning the conditional probabilities in a Bayesian network with a given structure. Data are complete when each data point contains values for every variable in the probability model being learned. Complete data greatly simplify the problem of learning the parameters of a complex model. We will also look briefly at the problem of learning structure. Maximum-likelihood parameter learning: Discrete models Suppose we buy a bag of lime and cherry candy from a new manufacturer whose lime–cherry proportions are completely unknown—that is, the fraction could be anywhere between 0 and 1. In that case, we have a continuum of hypotheses. The parameter in this case, which we call θ, is the proportion of cherry candies, and the hypothesis is hθ. (The proportion of limes is just 1 − θ.) If we assume that all proportions are equally likely a priori, then a maximumlikelihood approach is reasonable. If we model the situation with a Bayesian network, we need just one random variable, Flavor (the flavor of a randomly chosen candy from the bag). It has values cherry and lime, where the probability of cherry is θ (see Figure 20.2(a)). Now suppose we unwrap N candies, of which c are cherries and ` = N − c are limes. According to Equation (20.3), the likelihood of this particular data set is P(d|hθ) = Y N j = 1 P(dj |hθ) = θ c · (1 − θ) ` . The maximum-likelihood hypothesis is given by the value of θ that maximizes this expresLOG LIKELIHOOD sion. The same value is obtained by maximizing the log likelihood, L(d|hθ) = log P(d|hθ) = X N j = 1 log P(dj |hθ) = c log θ + ` log(1 − θ) . (By taking logarithms, we reduce the product to a sum over the data, which is usually easier to maximize.) To find the maximum-likelihood value of θ, we differentiate L with respect to θ and set the resulting expression to zero: dL(d|hθ) dθ = c θ − ` 1 − θ = 0 ⇒ θ = c c + ` = c N . In English, then, the maximum-likelihood hypothesis hML asserts that the actual proportion of cherries in the bag is equal to the observed proportion in the candies unwrapped so far! It appears that we have done a lot of work to discover the obvious. In fact, though, we have laid out one standard method for maximum-likelihood parameter learning: 1. Write down an expression for the likelihood of the data as a function of the parameter(s). 2. Write down the derivative of the log likelihood with respect to each parameter. 3. Find the parameter values such that the derivatives are zero