722 Chapter 20.Statistical Learning Methods eter to get the posterio similarly afte Th an view eta and scem 1 aves ex and1actual lime candies By data For ex ly how the 2050 ch act sequence beta e dis ributio is conve same results as maximum-likelihoo ne work in Figure 20b)has three arame .and 02.where 0 is the on ility of a re NOEPENDENCE me candy. Th ypo cove ee pa s-that is.we need P(6,61,2)=P(Θ)P(61)P(2) With this assumption,each parameter can have its own beta distribution that is updated sep- ve ha e the idea that unknov ral to need to make porate ample.if we ha ndic 3 and W bability each per1,W P(Flavor;=cherry=0)=0. Similarly,the wrapper probabilities depend on and .For example P(Wrapper;red Flavor;=cherry,e1=61)=01. Now,the entire Bayesian learning process can be formulated as an inference problem in a suitably constructed Bayes net,as shown in Figure 20.6.Prediction for a new instance is done simply by adding new instance variables to the network,some of which are queried This formulation of learning and prediction makes it clear that Bayesian learning requires no extra"principles of learning."Furthermore,there is,in essence,just one learning algorithm, i.e,the inference algorithm for Bayesian networks. Learning Bayes net structures So far,we have assumed that the structure of the Bayes net is given and we are just trying to learn the parameters.The structure of the network represents basic causal knowledge about the domain that is often easy for an expert,or even a naive user,to supply.In some cases, however,the causal model may be unavailable or subject to dispute-for example,certain corporations have long claimed that smoking does not cause cancer-so it is important to
722 Chapter 20. Statistical Learning Methods Thus, after seeing a cherry candy, we simply increment the a parameter to get the posterior; similarly, after seeing a lime candy, we increment the b parameter. Thus, we can view the a VIRTUAL COUNTS and b hyperparameters as virtual counts, in the sense that a prior beta[a, b] behaves exactly as if we had started out with a uniform prior beta[1, 1] and seen a − 1 actual cherry candies and b − 1 actual lime candies. By examining a sequence of beta distributions for increasing values of a and b, keeping the proportions fixed, we can see vividly how the posterior distribution over the parameter Θ changes as data arrive. For example, suppose the actual bag of candy is 75% cherry. Figure 20.5(b) shows the sequence beta[3, 1], beta[6, 2], beta[30, 10]. Clearly, the distribution is converging to a narrow peak around the true value of Θ. For large data sets, then, Bayesian learning (at least in this case) converges to give the same results as maximum-likelihood learning. The network in Figure 20.2(b) has three parameters, θ, θ1, and θ2, where θ1 is the probability of a red wrapper on a cherry candy and θ2 is the probability of a red wrapper on a lime candy. The Bayesian hypothesis prior must cover all three parameters—that is, we need to specify P(Θ, Θ1, Θ2). Usually, we assume parameter independence: PARAMETER INDEPENDENCE P(Θ, Θ1, Θ2) = P(Θ)P(Θ1)P(Θ2) . With this assumption, each parameter can have its own beta distribution that is updated separately as data arrive. Once we have the idea that unknown parameters can be represented by random variables such as Θ, it is natural to incorporate them into the Bayesian network itself. To do this, we also need to make copies of the variables describing each instance. For example, if we have observed three candies then we need Flavor 1, Flavor 2, Flavor 3 and Wrapper1, Wrapper 2, Wrapper3. The parameter variable Θ determines the probability of each Flavor i variable: P(Flavori = cherry|Θ = θ) = θ . Similarly, the wrapper probabilities depend on Θ1 and Θ2, For example, P(Wrapperi = red|Flavor i = cherry, Θ1 = θ1) = θ1 . Now, the entire Bayesian learning process can be formulated as an inference problem in a suitably constructed Bayes net, as shown in Figure 20.6. Prediction for a new instance is done simply by adding new instance variables to the network, some of which are queried. This formulation of learning and prediction makes it clear that Bayesian learning requires no extra “principles of learning.” Furthermore, there is, in essence, just one learning algorithm, i.e., the inference algorithm for Bayesian networks. Learning Bayes net structures So far, we have assumed that the structure of the Bayes net is given and we are just trying to learn the parameters. The structure of the network represents basic causal knowledge about the domain that is often easy for an expert, or even a naive user, to supply. In some cases, however, the causal model may be unavailable or subject to dispute—for example, certain corporations have long claimed that smoking does not cause cancer—so it is important to
Section 20.2.Learning with Complete Data 723 pper pper3 Figure 20.6 A Bayesian network that corresponds to a Bayesian learning process.Poste- rior distributions for the paramete r variables日,θi,and can be inferred from their prior distributions and the evidence in the Flvor and Wrapper,variables. derstand how the structure of a Bayes net can be learned from data.At present,structural in thei ill giv nly a hrief sketch idea oach is to od model We n start wit inks ar h n g parer node,fitting the para ith the ely .we itia mg model ch to retuning cach ure ations car clude rev change add ng,or de leting a ring is given fo ve parents only g th odes tha er in the sear st able derings ive methods for deciding when a good st re has been found dependen ns implicit tin the structure ar in the P(Fri/Sat,Bar|WillWait)=P(Fri/Sat]WillWait)P(Bar|WillWait) and we can check in the data that the same equation holds between the corresponding condi- tional frequencies.Now.even if the structure describes the true causal nature of the domain. statistical fuctuations in the data set mean that the equation will never he satisfied eractl so we need to perform a suitable statistical test to see if there is sufficient evidence that the independence hypothesis is violated.The complexity of the resulting network will depend
Section 20.2. Learning with Complete Data 723 Flavor1 Wrapper1 Flavor2 Wrapper2 Flavor3 Wrapper3 Θ Θ1 Θ2 Figure 20.6 A Bayesian network that corresponds to a Bayesian learning process. Posterior distributions for the parameter variables Θ, Θ1, and Θ2 can be inferred from their prior distributions and the evidence in the Flavor i and Wrapper i variables. understand how the structure of a Bayes net can be learned from data. At present, structural learning algorithms are in their infancy, so we will give only a brief sketch of the main ideas. The most obvious approach is to search for a good model. We can start with a model containing no links and begin adding parents for each node, fitting the parameters with the methods we have just covered and measuring the accuracy of the resulting model. Alternatively, we can start with an initial guess at the structure and use hill-climbing or simulated annealing search to make modifications, retuning the parameters after each change in the structure. Modifications can include reversing, adding, or deleting arcs. We must not introduce cycles in the process, so many algorithms assume that an ordering is given for the variables, and that a node can have parents only among those nodes that come earlier in the ordering (just as in the construction process Chapter 14). For full generality, we also need to search over possible orderings. There are two alternative methods for deciding when a good structure has been found. The first is to test whether the conditional independence assertionsimplicit in the structure are actually satisfied in the data. For example, the use of a naive Bayes model for the restaurant problem assumes that P(Fri/Sat, Bar|WillWait) = P(Fri/Sat|WillWait)P(Bar|WillWait) and we can check in the data that the same equation holds between the corresponding conditional frequencies. Now, even if the structure describes the true causal nature of the domain, statistical fluctuations in the data set mean that the equation will never be satisfied exactly, so we need to perform a suitable statistical test to see if there is sufficient evidence that the independence hypothesis is violated. The complexity of the resulting network will depend
724 Chapter 20.Statistical Learning Methods on the threshold used for this test-the stricter the independence test,the more links will be added and the appr mor with the odel con the data (i degr the prop ns this.howev nd e just ximum-like hood hypo s de anno forced to The MAP(or MDL) ach simply subtr a penalty fr the like (after para ing bef omparing c structures The Bayesian approach place a joint pr sum riables),so most practitioners use M MCM sample xity(whether by MAPor Bayesian methods)introduces an importan con ns in the network ns,the comple y penalty fo 80 with the of pa ons,it grows only This means that eaming wth OISy-OR noisy-OR (or other com pacty para Is tends to produce learned structures with more parents than doe eing with tabular distributions 20.3 LEARNING WITH HIDDEN VARIABLES:THE EM ALGORITHM The preceding section dealt with the fully observable case.Many real-world problems have LATENT VARIABLES hidden variables(sometimes called latent variables)which are not observable in the data that are available for learning.For example,medical records often include the observed symptoms,the treatment applied,and perhaps the outcome of the treatment,but they sel- dom contain a direct observation of the disease itself6 One might ask,"If the disease is not observed,why not construct a model without it?"The answer appears in Figure 20.7 which shows a small,fictitious diagnostic model for heart disease.There are three observ- able predisposing factors and three observable symptoms(which are too depressing to name) Assume that each variable has three possible values(e.g,none,moderate,and severe).Re- moving the hidden variable from the network in(a)yields the network in(b);the total number of parameters increases from 78 to 708.Thus.latent variables can dramatically reduce the number ofparameters required to specify a Bayesian netvork This,in tumn,can dramatically reduce the amount of data needed to learn the parameters. Hidden variables are important.but they do complicate the learnin ure 20.7(a),for example,it is not obvious how to leam the conditiona isbutionfor HeartDisease,given its parents,because we do not know the value of HeariDisease in each case,the same problem arises in learning the distributions for the symptoms.This section tain the toms.which arein bthe disease sted by the physicin,but this is acausal ce of the symp
724 Chapter 20. Statistical Learning Methods on the threshold used for this test—the stricter the independence test, the more links will be added and the greater the danger of overfitting. An approach more consistent with the ideas in this chapter is to the degree to which the proposed model explains the data (in a probabilistic sense). We must be careful how we measure this, however. If we just try to find the maximum-likelihood hypothesis, we will end up with a fully connected network, because adding more parents to a node cannot decrease the likelihood (Exercise 20.9). We are forced to penalize model complexity in some way. The MAP (or MDL) approach simply subtracts a penalty from the likelihood of each structure (after parameter tuning) before comparing different structures. The Bayesian approach places a joint prior over structures and parameters. There are usually far too many structures to sum over (superexponential in the number of variables), so most practitioners use MCMC to sample over structures. Penalizing complexity (whether by MAP or Bayesian methods) introduces an important connection between the optimal structure and the nature of the representation for the conditional distributions in the network. With tabular distributions, the complexity penalty for a node’s distribution grows exponentially with the number of parents, but with, say, noisy-OR distributions, it grows only linearly. This means that learning with noisy-OR (or other compactly parameterized) models tends to produce learned structures with more parents than does learning with tabular distributions. 20.3 LEARNING WITH HIDDEN VARIABLES: THE EM ALGORITHM The preceding section dealt with the fully observable case. Many real-world problems have LATENT VARIABLES hidden variables (sometimes called latent variables) which are not observable in the data that are available for learning. For example, medical records often include the observed symptoms, the treatment applied, and perhaps the outcome of the treatment, but they seldom contain a direct observation of the disease itself!6 One might ask, “If the disease is not observed, why not construct a model without it?” The answer appears in Figure 20.7, which shows a small, fictitious diagnostic model for heart disease. There are three observable predisposing factors and three observable symptoms (which are too depressing to name). Assume that each variable has three possible values (e.g., none, moderate, and severe). Removing the hidden variable from the network in (a) yields the network in (b); the total number of parameters increases from 78 to 708. Thus, latent variables can dramatically reduce the number of parameters required to specify a Bayesian network. This, in turn, can dramatically reduce the amount of data needed to learn the parameters. Hidden variables are important, but they do complicate the learning problem. In Figure 20.7(a), for example, it is not obvious how to learn the conditional distribution for HeartDisease, given its parents, because we do not know the value of HeartDisease in each case; the same problem arises in learning the distributions for the symptoms. This section 6 Some records contain the diagnosis suggested by the physician, but this is a causal consequence of the symptoms, which are in turn caused by the disease
Section 20.3.Learning with Hidden Variables:The EM Algorithm 725 RR只 (a) longer conditionally independent given their parents.This network requires 08 parameters. 2 ral de developed,one can Unsupervised clustering:Learning mixtures of Gaussians CAUSTERNGED Unsupervised clustering is the problem of discerning multiple categories in a collection of objects.The problem is unsupervised because the category labels are not given.For example, suppose we record the spectra of a hundred thousand stars;are there different types of stars revealed by the spectra,and,if so,how many and what are their characteristics?We are all familiar with terms such as"“red giant”and“white dwarf.”but the stars do not carry these labels on their hats-astronomers had to perform unsupervised clustering to identify these categories.Other examples include the identification of species,genera,orders,and so on in the Linnaean taxonomy of organisms and the creation of natural kinds to categorize ordinary objects (see Chapter 10). Unsupervised clustering begins with data.Figure 20.8(a)shows 500 data points,each of which specifies the values of two continuous attributes.The data points might correspond to stars,and the attributes might correspond to spectral intensities at two particular frequencies. Next.we need to understand what kind of probability distribution might have generated the nCsmePopottonsaonawtrrRe品oa sing a component and then generating a sample from that component. component and then gener then the mixture
Section 20.3. Learning with Hidden Variables: The EM Algorithm 725 Smoking Diet Exercise Symptom1 Symptom2 Symptom3 (a) (b) HeartDisease Smoking Diet Exercise Symptom1 Symptom2 Symptom3 2 2 2 54 6 6 6 2 2 2 54 162 486 Figure 20.7 (a) A simple diagnostic network for heart disease, which is assumed to be a hidden variable. Each variable has three possible values and is labeled with the number of independent parameters in its conditional distribution; the total number is 78. (b) The equivalent network with HeartDisease removed. Note that the symptom variables are no longer conditionally independent given their parents. This network requires 708 parameters. MAXIMIZA EXPECTATION– TION describes an algorithm called expectation–maximization, or EM, that solves this problem in a very general way. We will show three examples and then provide a general description. The algorithm seems like magic at first, but once the intuition has been developed, one can find applications for EM in a huge range of learning problems. Unsupervised clustering: Learning mixtures of Gaussians Unsupervised clustering is the problem of discerning multiple categories in a collection of UNSUPERVISED CLUSTERING objects. The problem is unsupervised because the category labels are not given. For example, suppose we record the spectra of a hundred thousand stars; are there different types of stars revealed by the spectra, and, if so, how many and what are their characteristics? We are all familiar with terms such as “red giant” and “white dwarf,” but the stars do not carry these labels on their hats—astronomers had to perform unsupervised clustering to identify these categories. Other examples include the identification of species, genera, orders, and so on in the Linnæan taxonomy of organisms and the creation of natural kinds to categorize ordinary objects (see Chapter 10). Unsupervised clustering begins with data. Figure 20.8(a) shows 500 data points, each of which specifies the values of two continuous attributes. The data points might correspond to stars, and the attributes might correspond to spectral intensities at two particular frequencies. Next, we need to understand what kind of probability distribution might have generated the data. Clustering presumes that the data are generated from a mixture distribution, P. Such a MIXTURE DISTRIBUTION COMPONENT distribution has k components, each of which is a distribution in its own right. A data point is generated by first choosing a component and then generating a sample from that component. Let the random variable C denote the component, with values 1, . . . , k; then the mixture
726 Chapter 20. Statistical Learning Methods e 08 0s 06 0.6 0.4 02 0.2 00.2040.60.81 0020.40.60.87 00.2040.60.8 (a) (b) (c) Figure 20.8 (a)500 data points in two dimensions,suggesting the pr ce of three clu e componen o-rignt)are 0. were gene 7s0 EM from the data in(b). distribution is given by Px)=∑P(C=)PxC=) NS mixture of Gaussians family of distributions.The parameters of a mixture of Gaussians are wi=P(C=i)(the weight of each component).(the mean of each component),and S (the covariance of each component).Figure 20.8(b)shows a mixture of three Gaussians;this mixture is in fact the source of the data in(a). The unsupervised clustering problem,then,is to recover a mixture model like the one in Figure 20.8(b)from raw data like that in Figure 20.8(a).Clearly,if we knew which component generated each data point,then it would be easy to recover the component Gaussians:we could just select all the data points from a given component and then apply (a multivariate version of)Equation(20.4)for fitting the parameters of a Gaussian to a set of data.On the other hand,if we knew the parameters of each component,then we could,at least in a probabilistic sense,assign each data point to a component.The problem is that we know neither the assignments nor the parameters. The basic idea of EM in this context is to pretendthat we know thethe parameters of the model and then to infer the probability that each data point belongs to each component.After that,we refit the components to the data,where each component is fitted to the entire data set with each point weighted by the probability that it belongs to that component.The process iterates until convergence.Essentially,we are"completing"the data by inferring probability distributions over the hidden variables-which component each data point belongs to-based on the current model.For the mixture of Gaussians,we initialize the mixture model parame- ters arbitrarily and then iterate the following two steps:
726 Chapter 20. Statistical Learning Methods 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 (a) (b) (c) Figure 20.8 (a) 500 data points in two dimensions, suggesting the presence of three clusters. (b) A Gaussian mixture model with three components; the weights (left-to-right) are 0.2, 0.3, and 0.5. The data in (a) were generated from this model. (c) The model reconstructed by EM from the data in (b). distribution is given by P(x) = X k i = 1 P(C = i) P(x|C = i) , where x refers to the values of the attributes for a data point. For continuous data, a natural choice for the component distributions is the multivariate Gaussian, which gives the so-called mixture of Gaussians family of distributions. The parameters of a mixture of Gaussians are MIXTURE OF GAUSSIANS wi = P(C = i) (the weight of each component), µi (the mean of each component), and Σi (the covariance of each component). Figure 20.8(b) shows a mixture of three Gaussians; this mixture is in fact the source of the data in (a). The unsupervised clustering problem, then, is to recover a mixture model like the one in Figure 20.8(b) from raw data like that in Figure 20.8(a). Clearly, if we knew which component generated each data point, then it would be easy to recover the component Gaussians: we could just select all the data points from a given component and then apply (a multivariate version of) Equation (20.4) for fitting the parameters of a Gaussian to a set of data. On the other hand, if we knew the parameters of each component, then we could, at least in a probabilistic sense, assign each data point to a component. The problem is that we know neither the assignments nor the parameters. The basic idea of EM in this context is to pretend that we know the the parameters of the model and then to infer the probability that each data point belongs to each component. After that, we refit the components to the data, where each component is fitted to the entire data set with each point weighted by the probability that it belongs to that component. The process iterates until convergence. Essentially, we are “completing” the data by inferring probability distributions over the hidden variables—which component each data point belongs to—based on the current model. For the mixture of Gaussians, we initialize the mixture model parameters arbitrarily and then iterate the following two steps: