当前位置：和泉文库 > 电子与通信 > 浏览文档

《贝叶斯学习与随机矩阵及在无线通信中的应用 BI-RM-AWC》课程教学资源（文献书籍）Pattern Recognition and Machine Learning

1 Introduction 2 Probability Distributions 3 Linear Models for Regression 4 Linear Models for Classification 5 Neural Networks 6 Kernel Methods 7 Sparse Kernel Machines 8 Graphical Models 9 Mixture Models and EM 10 Approximate Inference 11 Sampling Methods 12 Continuous Latent Variables 13 Sequential Data 14 Combining Models

文件格式：PDF，文件大小：7.32MB，售价：70.2元

共700页，可试读40页，点击往前阅读 ↑↑

文档详细内容（约700页）

6 1.INTRODUCTION Figure 1.3 The error function (1.2)corre- sponds to (one half of)the sum of t the squares of the displacements (shown by the vertical green bars) of each data point from the function y(r,w). y(n;W) function y(r,w)were to pass exactly through each training data point.The geomet- rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3. We can solve the curve fitting problem by choosing the value of w for which E(w)is as small as possible.Because the error function is a quadratic function of the coefficients w,its derivatives with respect to the coefficients will be linear in the elements of w,and so the minimization of the error function has a unique solution, Exercise 1.1 denoted by w*,which can be found in closed form.The resulting polynomial is given by the function y(,w*). There remains the problem of choosing the order M of the polynomial,and as we shall see this will turn out to be an example of an important concept called model comparison or model selection.In Figure 1.4,we show four examples of the results of fitting polynomials having orders M =0,1,3,and 9 to the data set shown in Figure 1.2. We notice that the constant (M =0)and first order (M 1)polynomials give rather poor fits to the data and consequently rather poor representations of the function sin(2mx).The third order (M=3)polynomial seems to give the best fit to the function sin(2x)of the examples shown in Figure 1.4.When we go to a much higher order polynomial (M=9),we obtain an excellent fit to the training data.In fact,the polynomial passes exactly through each data point and E(w*)=0. However,the fitted curve oscillates wildly and gives a very poor representation of the function sin(2m).This latter behaviour is known as over-fitting. As we have noted earlier,the goal is to achieve good generalization by making accurate predictions for new data.We can obtain some quantitative insight into the dependence of the generalization performance on M by considering a separate test set comprising 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values.For each choice of M,we can then evaluate the residual value of E(w*)given by (1.2)for the training data,and we can also evaluate E(w*) for the test data set.It is sometimes more convenient to use the root-mean-square

6 1. INTRODUCTION Figure 1.3 The error function (1.2) corresponds to (one half of) the sum of the squares of the displacements (shown by the vertical green bars) of each data point from the function y(x, w). t x y(xn, w) tn xn function y(x, w) were to pass exactly through each training data point. The geometrical interpretation of the sum-of-squares error function is illustrated in Figure 1.3. We can solve the curve fitting problem by choosing the value of w for which E(w) is as small as possible. Because the error function is a quadratic function of the coefficients w, its derivatives with respect to the coefficients will be linear in the elements of w, and so the minimization of the error function has a unique solution, denoted by w⋆ Exercise 1.1 , which can be found in closed form. The resulting polynomial is given by the function y(x, w⋆). There remains the problem of choosing the order M of the polynomial, and as we shall see this will turn out to be an example of an important concept called model comparison or model selection. In Figure 1.4, we show four examples of the results of fitting polynomials having orders M = 0, 1, 3, and 9 to the data set shown in Figure 1.2. We notice that the constant (M = 0) and first order (M = 1) polynomials give rather poor fits to the data and consequently rather poor representations of the function sin(2πx). The third order (M = 3) polynomial seems to give the best fit to the function sin(2πx) of the examples shown in Figure 1.4. When we go to a much higher order polynomial (M = 9), we obtain an excellent fit to the training data. In fact, the polynomial passes exactly through each data point and E(w⋆)=0. However, the fitted curve oscillates wildly and gives a very poor representation of the function sin(2πx). This latter behaviour is known as over-fitting. As we have noted earlier, the goal is to achieve good generalization by making accurate predictions for new data. We can obtain some quantitative insight into the dependence of the generalization performance on M by considering a separate test set comprising 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values. For each choice of M, we can then evaluate the residual value of E(w⋆) given by (1.2) for the training data, and we can also evaluate E(w⋆) for the test data set. It is sometimes more convenient to use the root-mean-square

8 1. INTRODUCTION Figure 1.5 Graphs of the root-mean-square error, defined by (1.3), evaluated on the training set and on an independent test set for various values of M. M ERMS 0 3 6 9 0 0.5 1 Training Test For M = 9, the training set error goes to zero, as we might expect because this polynomial contains 10 degrees of freedom corresponding to the 10 coefficients w0,...,w9, and so can be tuned exactly to the 10 data points in the training set. However, the test set error has become very large and, as we saw in Figure 1.4, the corresponding function y(x, w⋆) exhibits wild oscillations. This may seem paradoxical because a polynomial of given order contains all lower order polynomials as special cases. The M = 9 polynomial is therefore capable of generating results at least as good as the M = 3 polynomial. Furthermore, we might suppose that the best predictor of new data would be the function sin(2πx) from which the data was generated (and we shall see later that this is indeed the case). We know that a power series expansion of the function sin(2πx) contains terms of all orders, so we might expect that results should improve monotonically as we increase M. We can gain some insight into the problem by examining the values of the coefficients w⋆ obtained from polynomials of various order, as shown in Table 1.1. We see that, as M increases, the magnitude of the coefficients typically gets larger. In particular for the M = 9 polynomial, the coefficients have become finely tuned to the data by developing large positive and negative values so that the correspondTable 1.1 Table of the coefficients w⋆ for polynomials of various order. Observe how the typical magnitude of the coefficients increases dramatically as the order of the polynomial increases. M = 0 M = 1 M = 6 M = 9 w⋆ 0 0.19 0.82 0.31 0.35 w⋆ 1 -1.27 7.99 232.37 w⋆ 2 -25.43 -5321.83 w⋆ 3 17.37 48568.31 w⋆ 4 -231639.30 w⋆ 5 640042.26 w⋆ 6 -1061800.52 w⋆ 7 1042400.18 w⋆ 8 -557682.99 w⋆ 9 125201.43

1.1.Example:Polynomial Curve Fitting 9 N=15 o880 品 N=100 P Q0 O 6 00, 0 0 1 Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M=9 polynomial for N =15 data points (left plot)and N =100 data points(right plot).We see that increasing the size of the data set reduces the over-fitting problem. ing polynomial function matches each of the data points exactly,but between data points(particularly near the ends of the range)the function exhibits the large oscilla- tions observed in Figure 1.4.Intuitively,what is happening is that the more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values. It is also interesting to examine the behaviour of a given model as the size of the data set is varied,as shown in Figure 1.6.We see that,for a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set,the more complex (in other words more flexible)the model that we can afford to fit to the data.One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10)of the number of adaptive parameters in the model.However,as we shall see in Chapter 3,the number of parameters is not necessarily the most appropriate measure of model complexity. Also,there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set.It would seem more reasonable to choose the complexity of the model according to the com- plexity of the problem being solved.We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood (discussed in Section 1.2.5),and that the over-fitting problem can be understood as Section 3.4 a general property of maximum likelihood.By adopting a Bayesian approach,the over-fitting problem can be avoided.We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points.Indeed,in a Bayesian model the effective number of parameters adapts automatically to the size of the data set. For the moment,however,it is instructive to continue with the current approach and to consider how in practice we can apply it to data sets of limited size where we

1.1. Example: Polynomial Curve Fitting 9 x t N = 15 0 1 −1 0 1 x t N = 100 0 1 −1 0 1 Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9 polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the size of the data set reduces the over-fitting problem. ing polynomial function matches each of the data points exactly, but between data points (particularly near the ends of the range) the function exhibits the large oscillations observed in Figure 1.4. Intuitively, what is happening is that the more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values. It is also interesting to examine the behaviour of a given model as the size of the data set is varied, as shown in Figure 1.6. We see that, for a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data. One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. However, as we shall see in Chapter 3, the number of parameters is not necessarily the most appropriate measure of model complexity. Also, there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set. It would seem more reasonable to choose the complexity of the model according to the complexity of the problem being solved. We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood (discussed in Section 1.2.5), and that the over-fitting problem can be understood as Section 3.4 a general property of maximum likelihood. By adopting a Bayesian approach, the over-fitting problem can be avoided. We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set. For the moment, however, it is instructive to continue with the current approach and to consider how in practice we can apply it to data sets of limited size where we

10 1.INTRODUCTION nλ=-18 ln入=0 ● 0 0 Figure 1.7 Plots of M=9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error function (1.4)for two values of the regularization parameter A corresponding to In A=-18 and In A=0.The case of no regularizer,i.e.,A=0,corresponding to In A=-oo,is shown at the bottom right of Figure 1.4. may wish to use relatively complex and flexible models.One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function(1.2)in order to discourage the coefficients from reaching large values.The simplest such penalty term takes the form of a sum of squares of all of the coefficients,leading to a modified error function of the form (w)= fu(cn.w)-ta}2+jlwlP (1.4) n=1 where w=wTw=w+w+...+,and the coefficient A governs the rel- ative importance of the regularization term compared with the sum-of-squares error term.Note that often the coefficient wo is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al.,2001),or it may be included but with its own regularization coefficient (we shall discuss this topic in more detail in Section 5.5.1).Again,the error function Exercise 1.2 in(1.4)can be minimized exactly in closed form.Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients.The particular case of a quadratic regularizer is called ridge regres- sion(Hoerl and Kennard,1970).In the context of neural networks,this approach is known as weight decay. Figure 1.7 shows the results of fitting the polynomial of order M =9 to the same data set as before but now using the regularized error function given by (1.4). We see that,for a value of In A =-18,the over-fitting has been suppressed and we now obtain a much closer representation of the underlying function sin(2m).If, however,we use too large a value for A then we again obtain a poor fit,as shown in Figure 1.7 for In A =0.The corresponding coefficients from the fitted polynomials are given in Table 1.2,showing that regularization has the desired effect of reducing

10 1. INTRODUCTION x t ln λ = −18 0 1 −1 0 1 x t ln λ = 0 0 1 −1 0 1 Figure 1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error function (1.4) for two values of the regularization parameter λ corresponding to ln λ = −18 and ln λ = 0. The case of no regularizer, i.e., λ = 0, corresponding to ln λ = −∞, is shown at the bottom right of Figure 1.4. may wish to use relatively complex and flexible models. One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form E$(w) = 1 2 " N n=1 {y(xn, w) − tn} 2 + λ 2 ∥w∥2 (1.4) where ∥w∥2 ≡ wTw = w2 0 + w2 1 + ... + w2 M, and the coefficient λ governs the relative importance of the regularization term compared with the sum-of-squares error term. Note that often the coefficient w0 is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al., 2001), or it may be included but with its own regularization coefficient (we shall discuss this topic in more detail in Section 5.5.1). Again, the error function Exercise 1.2 in (1.4) can be minimized exactly in closed form. Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients. The particular case of a quadratic regularizer is called ridge regression (Hoerl and Kennard, 1970). In the context of neural networks, this approach is known as weight decay. Figure 1.7 shows the results of fitting the polynomial of order M = 9 to the same data set as before but now using the regularized error function given by (1.4). We see that, for a value of ln λ = −18, the over-fitting has been suppressed and we now obtain a much closer representation of the underlying function sin(2πx). If, however, we use too large a value for λ then we again obtain a poor fit, as shown in Figure 1.7 for ln λ = 0. The corresponding coefficients from the fitted polynomials are given in Table 1.2, showing that regularization has the desired effect of reducing

点击进入文档下载页（PDF格式）

共700页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录