Section 20.2.Learning with Complete Data 717 PF-cherry】 PF-chery) Flavor ☐PW=ed Flavor Wrapper (a ) bilistically)on the candy flavor. The trickiest step is usually the last.In our example,it was trivial,but we will see that in eed to ort to iterative solution algorithn encaloptimizaion oblen ood eari eral n the datd set is s et be a ricks are used to avoid this ing the cou or each event to1 insteado uppose th wants to give The W. for eachcandy es candy wrappers orodedandgre is sele ilistically, s,the 4e20 thre arameters param eliho ined from the standard semantics P(Flavor=cherry,Wrapper=greenho.0.0) =P(Flavor= cherrylho)P(Wrapper=green Flavor=cherry,ho.0.) =0.1-01) 网a阿 P(dhe.a1,2)=0r(1-0.(1-a)e.(1-a2)9 This looks pretty horrible,but taking logarithms help L [clog0+6l0g(1-0)]+[re log01+ge log(1-01)]+[relog 02 ge log(1-02)]. The benefit of taking logs is clear:the log likelihood is the sum of three terms each of which contains a single parameter.When we take derivatives with respect to each parameter and set
Section 20.2. Learning with Complete Data 717 Flavor P(F=cherry) (a) θ P(F=cherry) Flavor Wrapper (b) θ F cherry lime P(W=red | F) θ1 θ2 Figure 20.2 (a) Bayesian network model for the case of candies with an unknown proportion of cherries and limes. (b) Model for the case where the wrapper color depends (probabilistically) on the candy flavor. The trickiest step is usually the last. In our example, it was trivial, but we will see that in many cases we need to resort to iterative solution algorithms or other numerical optimization techniques, as described in Chapter 4. The example also illustrates a significant problem with maximum-likelihood learning in general: when the data set is small enough that some events have not yet been observed—for instance, no cherry candies—the maximum likelihood hypothesis assigns zero probability to those events. Various tricks are used to avoid this problem, such as initializing the counts for each event to 1 instead of zero. Let us look at another example. Suppose this new candy manufacturer wants to give a little hint to the consumer and uses candy wrappers colored red and green. The Wrapper for each candy is selected probabilistically, according to some unknown conditional distribution, depending on the flavor. The corresponding probability model is shown in Figure 20.2(b). Notice that it has three parameters: θ, θ1, and θ2. With these parameters, the likelihood of seeing, say, a cherry candy in a green wrapper can be obtained from the standard semantics for Bayesian networks (page 495): P(Flavor = cherry,Wrapper = green|hθ,θ1,θ2 ) = P(Flavor = cherry|hθ,θ1,θ2 )P(Wrapper = green|Flavor = cherry, hθ,θ1,θ2 ) = θ · (1 − θ1) . Now, we unwrap N candies, of which c are cherries and ` are limes. The wrapper counts are as follows: rc of the cherries have red wrappers and gc have green, while r` of the limes have red and g` have green. The likelihood of the data is given by P(d|hθ,θ1,θ2 ) = θ c (1 − θ) ` · θ rc 1 (1 − θ1) gc · θ r` 2 (1 − θ2) g` . This looks pretty horrible, but taking logarithms helps: L = [c log θ + ` log(1 − θ)] + [rc log θ1 + gc log(1 − θ1)] + [r` log θ2 + g` log(1 − θ2)] . The benefit of taking logs is clear: the log likelihood is the sum of three terms, each of which contains a single parameter. When we take derivatives with respect to each parameter and set
718 Chapter 20.Statistical Learning Methods them to zero.we get three independent equations.each containing just one parameter =-=0 0 含0=中 %=0 ÷=离 The solution for is the same as before.The solution for the wrapper,is the observed fra ction cherry candies ith wrappers,and Thes sults are very Bayesi nditio int is that babilities are as ta he mo t impo the er.The composes mo separate one for each pa oint is t for a varab given its pa frequ e variabl ts,are Ju f the parent values.As before, we m I zeroes when t the data set is sma Naive Bayes models o"e ewrode如网 Bay variable C(which is to be ariables X: e the lea s The odel is"naive" ce it o nditionally of n the lass odel ir gure 202h naive Bayes model with just one attribute.)Assuming Bool the parameters are 0=P(C=true),0=P(Xi=truelC=true),0:2=P(Xi=truelC=false). The maximum-likelihood parameter values are found in exactly the same way as for Fig- ure 20.2(b).Once the model has been trained in this way,it can be used to classify new exam- ples for which the class variable C is unobserved.With observed attribute values.... the probability of each class is given by P(Clr1,...,zn)=a P(C)IIP(zilC). A deterministic prediction can be obtained by choosing the most likely class.Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from Chapter 18.The method learns fairly well but not as well as decision-tree learning;this is presumably because the true hypothesis-which is a decision tree-is not representable ex- actly using a naive Bayes model.Naive Bayes learning turns out to do surprisingly well in a wide range of applications,the boosted version(Exercise 20.5)is one of the most effective general-purpose learning algorithms.Naive Bayes learning scales well to very large prob- 自 lems:with n Boolean attributes,there are just 2n+1 parameters,and no search is required to find hML.the maximum-likelihood naive Bayes hypothesis.Finally,naive Bayes learning has no difficulty with noisy data and can give probabilistic predictions when appropriate. See Exereise 20.7 for the nontabulated case,where each parameter affects several conditional probabilities
718 Chapter 20. Statistical Learning Methods them to zero, we get three independent equations, each containing just one parameter: ∂L ∂θ = c θ − ` 1−θ = 0 ⇒ θ = c c+` ∂L ∂θ1 = rc θ1 − gc 1−θ1 = 0 ⇒ θ1 = rc rc+gc ∂L ∂θ2 = r` θ2 − g` 1−θ2 = 0 ⇒ θ2 = r` r`+g` . The solution for θ is the same as before. The solution for θ1, the probability that a cherry candy has a red wrapper, is the observed fraction of cherry candies with red wrappers, and similarly for θ2. These results are very comforting, and it is easy to see that they can be extended to any Bayesian network whose conditional probabilities are represented as tables. The most important point is that, with complete data, the maximum-likelihood parameter learning problem for a Bayesian network decomposes into separate learning problems, one for each parameter. 3 The second point is that the parameter values for a variable, given its parents, are just the observed frequencies of the variable values for each setting of the parent values. As before, we must be careful to avoid zeroes when the data set is small. Naive Bayes models Probably the most common Bayesian network model used in machine learning is the naive Bayes model. In this model, the “class” variable C (which is to be predicted) is the root and the “attribute” variables Xi are the leaves. The model is “naive” because it assumes that the attributes are conditionally independent of each other, given the class. (The model in Figure 20.2(b) is a naive Bayes model with just one attribute.) Assuming Boolean variables, the parameters are θ = P(C = true), θi1 = P(Xi = true|C = true), θi2 = P(Xi = true|C = false). The maximum-likelihood parameter values are found in exactly the same way as for Figure 20.2(b). Once the model has been trained in this way, it can be used to classify new examples for which the class variable C is unobserved. With observed attribute values x1, . . . , xn, the probability of each class is given by P(C|x1, . . . , xn) = α P(C) Y i P(xi |C) . A deterministic prediction can be obtained by choosing the most likely class. Figure 20.3 shows the learning curve for this method when it is applied to the restaurant problem from Chapter 18. The method learns fairly well but not as well as decision-tree learning; this is presumably because the true hypothesis—which is a decision tree—is not representable exactly using a naive Bayes model. Naive Bayes learning turns out to do surprisingly well in a wide range of applications; the boosted version (Exercise 20.5) is one of the most effective general-purpose learning algorithms. Naive Bayes learning scales well to very large problems: with n Boolean attributes, there are just 2n + 1 parameters, and no search is required to find hML, the maximum-likelihood naive Bayes hypothesis. Finally, naive Bayes learning has no difficulty with noisy data and can give probabilistic predictions when appropriate. 3 See Exercise 20.7 for the nontabulated case, where each parameter affects several conditional probabilities
Section 20.2.Learning with Complete Data 719 08 07 20 40 60 80100 Trainine set size Figure20.3 The learning curve for naive Bayes learning applied to the restaurant problem from Chapter 18;the learning curve for decision-tree learning is shown for comparison. Maximum-likelihood parameter learning:Continuous models Continuous probability models such as the linear-Gaussian model were introduced in Sec. portant to know how to leamn continugus models from data likelihood learning are identical to those of the discrete case I et us he in with a very simple case:learning the arameters of a Gaussian density function on a single variable.That is.the data are g enerated as follows: P(x)= √2 The parameters of this model are the mean u and the standard deviation a (Notice that the normalizing"constant"depends on o,so we cannot ignore it.)Let the observed values be N.Then the log likelihood is 。e 1 -=N(-log v2-logo)- X(-)2 122 Setting the derivatives to zero as usual,we obtain =-(红-4=0 →“= →=V②色9 (20.4) 0=-¥+÷∑1(-4)2=0 That is,the maximum-likelihood value of the mean is the sample average and the maximum- likelihood value of the standard deviation is the square root of the sample variance.Again. these are comforting results that confirm"commor Now consider a linear Gaussian model with one continuous parent X and a continuous child Y.As explained on page 502.Y has a Gaussian distribution whose mean depends linearly on the value of X and whose standard deviation is fixed.To learn the conditional
Section 20.2. Learning with Complete Data 719 0.4 0.5 0.6 0.7 0.8 0.9 1 0 20 40 60 80 100 Proportion correct on test set Training set size Decision tree Naive Bayes Figure 20.3 The learning curve for naive Bayes learning applied to the restaurant problem from Chapter 18; the learning curve for decision-tree learning is shown for comparison. Maximum-likelihood parameter learning: Continuous models Continuous probability models such as the linear-Gaussian model were introduced in Section 14.3. Because continuous variables are ubiquitous in real-world applications, it is important to know how to learn continuous models from data. The principles for maximumlikelihood learning are identical to those of the discrete case. Let us begin with a very simple case: learning the parameters of a Gaussian density function on a single variable. That is, the data are generated as follows: P(x) = 1 √ 2πσ e − (x−µ) 2 2σ2 . The parameters of this model are the mean µ and the standard deviation σ. (Notice that the normalizing “constant” depends on σ, so we cannot ignore it.) Let the observed values be x1, . . . , xN . Then the log likelihood is L = X N j = 1 log 1 √ 2πσ e − (xj−µ) 2 2σ2 = N(− log √ 2π − log σ) − X N j = 1 (xj − µ) 2 2σ2 . Setting the derivatives to zero as usual, we obtain ∂L ∂µ = − 1 σ2 PN j=1(xj − µ) = 0 ⇒ µ = P j xj N ∂L ∂σ = − N σ + 1 σ3 PN j=1(xj − µ) 2 = 0 ⇒ σ = rP j (xj−µ) 2 N . (20.4) That is, the maximum-likelihood value of the mean is the sample average and the maximumlikelihood value of the standard deviation is the square root of the sample variance. Again, these are comforting results that confirm “commonsense” practice. Now consider a linear Gaussian model with one continuous parent X and a continuous child Y . As explained on page 502, Y has a Gaussian distribution whose mean depends linearly on the value of X and whose standard deviation is fixed. To learn the conditional
720 Chapter 20.Statistical Learning Methods 11 0.6 0.4 02 0.6 0 00.102030.40.50.60.70.80.9 (a) sian model described asy+2 plus Gaussian noise distribution P(YX),we can maximize the conditional likelihood 1 (20.5) Here,the parameters are02.and.The data are a collection of()pairs,as illustrated in Figure 20.4.Using the usual methods(Exercise 20.6),we can find the maximum-likelihood values of the parameters.Here,we want to make a different point.If we consider just the parameters and 02 that define the linear relationship betweenz andy,it becomes clear that maximizing the log likelihood with respect to these parameters is the same as minimizing the numerator in the exponent of Equation(20.5): E=∑防-01+2P =1 The qntty(y一4巧土》is theeroor(,%hats.thefr benthe and the dicted value :+02 Im ofs SUM OE SOURPED tity that is min zed by y the standard lin regr ssio edur INEAR REGRESSION ing the model.pro ided t squared he ght-line sia noise fixed variance. Bayesian parameter learning Maximum-likelihood learning gives rise to some very simple procedures.but it has some serious deficiencies with small data sets.For example,after seeing one cherry candy,the maximum-likelihood hypothesis is that the bag is 100%cherry (i.e.,=1.0).Unless one's hypothesis prior is that bags must be either all cherry or all lime,this is not a reasonable conclusion.The Bayesian approach to parameter learning places a hypothesis prior over the possible values of the parameters and updates this distribution as data arrive
720 Chapter 20. Statistical Learning Methods 0 0.2 0.4 0.6 0.8 x 1 0 0.20.40.60.81 y 0 0.5 1 1.5 2 2.5 3 3.5 4 P(y |x) 0 0.2 0.4 0.6 0.8 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 y x (a) (b) Figure 20.4 (a) A linear Gaussian model described as y = θ1x + θ2 plus Gaussian noise with fixed variance. (b) A set of 50 data points generated from this model. distribution P(Y |X), we can maximize the conditional likelihood P(y|x) = 1 √ 2πσ e − (y−(θ1x+θ2))2 2σ2 . (20.5) Here, the parameters are θ1, θ2, and σ. The data are a collection of (xj , yj ) pairs, as illustrated in Figure 20.4. Using the usual methods(Exercise 20.6), we can find the maximum-likelihood values of the parameters. Here, we want to make a different point. If we consider just the parameters θ1 and θ2 that define the linear relationship between x and y, it becomes clear that maximizing the log likelihood with respect to these parameters is the same as minimizing the numerator in the exponent of Equation (20.5): E = X N j = 1 (yj − (θ1xj + θ2))2 . ERROR The quantity (yj − (θ1xj + θ2)) is the error for (xj , yj )—that is, the difference between the actual value yj and the predicted value (θ1xj +θ2)—so E is the well-known sum of squared errors. This is the quantity that is minimized by the standard linear regression procedure. SUM OF SQUARED ERRORS LINEAR REGRESSION Now we can understand why: minimizing the sum of squared errors gives the maximumlikelihood straight-line model, provided that the data are generated with Gaussian noise of fixed variance. Bayesian parameter learning Maximum-likelihood learning gives rise to some very simple procedures, but it has some serious deficiencies with small data sets. For example, after seeing one cherry candy, the maximum-likelihood hypothesis is that the bag is 100% cherry (i.e., θ = 1.0). Unless one’s hypothesis prior is that bags must be either all cherry or all lime, this is not a reasonable conclusion. The Bayesian approach to parameter learning places a hypothesis prior over the possible values of the parameters and updates this distribution as data arrive
Section 20.2.Learning with Complete Data 721 25 5,1 6 30,10 22 15 621 0.5 0.40.6 0.8 02 0.4 0.6 0.8 Para (a) (b) Figure20.5 Examples of the beta distribution for different values of The candy example in Figure 202(a)has one meter,:the probability that arar domly selected red In the Ba an dis tha lue of a ra ariable the h h P S P just th rdistribution P().Thus P(=0)is the pric probab ility th an value b ag has a ontinuous that zero only o and 1 and that ir tegratest T e uniform one ca s a member f Each tion is defined by two betala,bl(0)=a0a-1(1-0)5-1, (20.6) for 0 in the range 0,1].The normalization constant a depends on a and b.(See Exer- cise 20.8.)Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the distribution is a/(ab),so larger values of a suggest a belief that is closer to I than to 0.Larger values ofa+b make the distribution more peaked,suggest- ing greater certainty about the value of Thus,the beta family provides a useful range of possibilities for the hypothesis prior. Besides its flexibility.the beta family has another wonderful pr rty if has a nrior betathneroird.the posterora CONJUGATE PROR distribution.The beta family is called the conjugate prior for the family of distributions for a Boolean variable.5 Let's see how this works.Suppose we observe a cherry candy.then P(D1=cherry)=a P(D1=cherryl0)P(0) =a'0.betafa,](0)=a'0.0-1(1-0)-1 =a'(1-)b betala +1,(0). 4They are called hyperparameters beca Smith (19)
Section 20.2. Learning with Complete Data 721 0 0.5 1 1.5 2 2.5 0 0.2 0.4 0.6 0.8 1 P(Θ = θ) Parameter θ [1,1] [2,2] [5,5] 0 1 2 3 4 5 6 0 0.2 0.4 0.6 0.8 1 P(Θ = θ) Parameter θ [3,1] [6,2] [30,10] (a) (b) Figure 20.5 Examples of the beta[a, b] distribution for different values of [a, b]. The candy example in Figure 20.2(a) has one parameter, θ: the probability that a randomly selected piece of candy is cherry flavored. In the Bayesian view, θ is the (unknown) value of a random variable Θ; the hypothesis prior is just the prior distribution P(Θ). Thus, P(Θ = θ) is the prior probability that the bag has a fraction θ of cherry candies. If the parameter θ can be any value between 0 and 1, then P(Θ) must be a continuous distribution that is nonzero only between 0 and 1 and that integratesto 1. The uniform density P(θ) = U[0, 1](θ) is one candidate. (See Chapter 13.) It turns out that the uniform density BETA DISTRIBUTIONS is a member of the family of beta distributions. Each beta distribution is defined by two hyperparameters4 HYPERPARAMETER a and b such that beta[a, b](θ) = α θ a−1 (1 − θ) b−1 , (20.6) for θ in the range [0, 1]. The normalization constant α depends on a and b. (See Exercise 20.8.) Figure 20.5 shows what the distribution looks like for various values of a and b. The mean value of the distribution is a/(a + b), so larger values of a suggest a belief that Θ is closer to 1 than to 0. Larger values of a + b make the distribution more peaked, suggesting greater certainty about the value of Θ. Thus, the beta family provides a useful range of possibilities for the hypothesis prior. Besides its flexibility, the beta family has another wonderful property: if Θ has a prior beta[a, b], then, after a data point is observed, the posterior distribution for Θ is also a beta CONJUGATE PRIOR distribution. The beta family is called the conjugate prior for the family of distributions for a Boolean variable.5 Let’s see how this works. Suppose we observe a cherry candy; then P(θ|D1 = cherry) = α P(D1 = cherry|θ)P(θ) = α 0 θ · beta[a, b](θ) = α 0 θ · θ a−1 (1 − θ) b−1 = α 0 θ a (1 − θ) b−1 = beta[a + 1, b](θ) . 4 They are called hyperparameters because they parameterize a distribution over θ, which is itself a parameter. 5 Other conjugate priors include the Dirichlet family for the parameters of a discrete multivalued distribution and the Normal–Wishart family for the parameters of a Gaussian distribution. See Bernardo and Smith (1994)