Section 20.3.Learning with Hidden Variables:The EM Algorithm 727 1.E-step:Compute the probabilitiesP(C)the probability that datum s'rule,we have Pij=aP(xjlC=i)P(C )Th term P(xjIC 2.M-step:Compute the new mean,covariance,and component weights as follows. 山-∑P/m -∑Pxx/m 1 Wi Pi The E-step,or expectation step,can be viewed as computing the expected values p of the hidden indicator variables where Zu is I if datum x:was ger erated by the ith compo nent and 0 otherwise.The M-step.or ma miation step,finds the new values of the param- eters that maximize the log likelihood of the data.giv en the expected values of the hidden indicator variables. The final model that EM lea when it is applied to the data in Figure 20.8(a)is shown in Figure 20.8(c):it is virtually indistinguishab e from the original model fron which the ted Fig e 209(a)plots the log likeli the data a to th current model as EM are oints to th f the final lea ed m which the vere generated This but it s ects the fact that the datav ated randomly and might ride odel Th nd point is that EMin f the n.This fact proven t is s se,EM res .1 mbing algorithm ut r tha paran ways go re 20.9(a)mights It can happe fo hrin e data )to ze 800 Anot is th rge,ac kinds g the ous pro espe in hig dimensio of EM is to pla e priors on the parameters and A restart a comp too sma oclose to another component.It also helps to init n paramet it ger ze the parameters with Learning Bayesian networks with hidden variables maocotet2oinopeamphscteo candies that have been mixed together.Candies are described by three features:in addition to the Flavor and the Wrapper,some candies have a Hole in the middle and some do not
Section 20.3. Learning with Hidden Variables: The EM Algorithm 727 1. E-step: Compute the probabilities pij = P(C = i|xj ), the probability that datum xj was generated by component i. By Bayes’ rule, we have pij = αP(xj |C = i)P(C = i). The term P(xj |C = i) is just the probability at xj of the ith Gaussian, and the term P(C = i) is just the weight parameter for the ith Gaussian. Define pi = P j pij . 2. M-step: Compute the new mean, covariance, and component weights as follows: µi ← X j pijxj/pi Σi ← X j pijxjx > j /pi wi ← pi . The E-step, or expectation step, can be viewed as computing the expected values pij of the INDICATOR VARIABLE hidden indicator variables Zij , where Zij is 1 if datum xj was generated by the ith component and 0 otherwise. The M-step, or maximization step, finds the new values of the parameters that maximize the log likelihood of the data, given the expected values of the hidden indicator variables. The final model that EM learns when it is applied to the data in Figure 20.8(a) is shown in Figure 20.8(c); it is virtually indistinguishable from the original model from which the data were generated. Figure 20.9(a) plots the log likelihood of the data according to the current model as EM progresses. There are two points to notice. First, the log likelihood for the final learned model slightly exceeds that of the original model, from which the data were generated. This might seem surprising, but it simply reflects the fact that the data were generated randomly and might not provide an exact reflection of the underlying model. The second point is that EM increases the log likelihood of the data at every iteration. This fact can be proved in general. Furthermore, under certain conditions, EM can be proven to reach a local maximum in likelihood. (In rare cases, it could reach a saddle point or even a local minimum.) In this sense, EM resembles a gradient-based hill-climbing algorithm, but notice that it has no “step size” parameter! Things do not always go as well as Figure 20.9(a) might suggest. It can happen, for example, that one Gaussian component shrinks so that it covers just a single data point. Then its variance will go to zero and its likelihood will go to infinity! Another problem is that two components can “merge,” acquiring identical means and variances and sharing their data points. These kinds of degenerate local maxima are serious problems, especially in high dimensions. One solution is to place priors on the model parameters and to apply the MAP version of EM. Another is to restart a component with new random parameters if it gets too small or too close to another component. It also helps to initialize the parameters with reasonable values. Learning Bayesian networks with hidden variables To learn a Bayesian network with hidden variables, we apply the same insights that worked for mixtures of Gaussians. Figure 20.10 represents a situation in which there are two bags of candies that have been mixed together. Candies are described by three features: in addition to the Flavor and the Wrapper, some candies have a Hole in the middle and some do not
728 Chapter 20.Statistical Learning Methods -1985 -2000 -2015 .2023 10 15 20 0 2040 60 80100120 (a) Figure 20.9 Graphs showing the log-likelihood of the data L,as a function of the EM iterat山ion. The horizontal line shows the log-likelihood according to the true model.(a) -1 gP(F=cherry B) Flavor Wrapper (Holes a (b) Figure 20.10 (a)A mixture model for candy.The proportions of diffe nt flavors,wrap the component C. The distribution of candies in each bag is described by a naive Bayes model:the features are independent,given the bag,but the conditional probability distribution for each feature depends on the bag.The parameters are as follows:is the prior probability that a candy comes from Bag 1;g and 0p2 are the probabilities that the flavor is cherry,given that the candy comes from Bag 1 and Bag 2 respectively:w and w2 give the probabilities that the wrapper is red;and and 2 give the probabilities that the candy has a hole.Notice that
728 Chapter 20. Statistical Learning Methods -200 -100 0 100 200 300 400 500 600 700 0 5 10 15 20 Log-likelihood L Iteration number -2025 -2020 -2015 -2010 -2005 -2000 -1995 -1990 -1985 -1980 -1975 0 20 40 60 80 100 120 Log-likelihood L Iteration number (a) (b) Figure 20.9 Graphs showing the log-likelihood of the data, L, as a function of the EM iteration. The horizontal line shows the log-likelihood according to the true model. (a) Graph for the Gaussian mixture model in Figure 20.8. (b) Graph for the Bayesian network in Figure 20.10(a). (a) (b) C Holes X Bag P(Bag=1) θ Flavor Wrapper Bag 1 2 P(F=cherry | B) θF2 θF1 Figure 20.10 (a) A mixture model for candy. The proportions of different flavors, wrappers, and numbers of holes depend on the bag, which is not observed. (b) Bayesian network for a Gaussian mixture. The mean and covariance of the observable variables X depend on the component C. The distribution of candies in each bag is described by a naive Bayes model: the features are independent, given the bag, but the conditional probability distribution for each feature depends on the bag. The parameters are as follows: θ is the prior probability that a candy comes from Bag 1; θF1 and θF2 are the probabilities that the flavor is cherry, given that the candy comes from Bag 1 and Bag 2 respectively; θW1 and θW2 give the probabilities that the wrapper is red; and θH1 and θH2 give the probabilities that the candy has a hole. Notice that