6 1.INTRODUCTION Figure 1.3 The error function (1.2)corre- sponds to (one half of)the sum of t the squares of the displacements (shown by the vertical green bars) of each data point from the function y(r,w). y(n;W) function y(r,w)were to pass exactly through each training data point.The geomet- rical interpretation of the sum-of-squares error function is illustrated in Figure 1.3. We can solve the curve fitting problem by choosing the value of w for which E(w)is as small as possible.Because the error function is a quadratic function of the coefficients w,its derivatives with respect to the coefficients will be linear in the elements of w,and so the minimization of the error function has a unique solution, Exercise 1.1 denoted by w*,which can be found in closed form.The resulting polynomial is given by the function y(,w*). There remains the problem of choosing the order M of the polynomial,and as we shall see this will turn out to be an example of an important concept called model comparison or model selection.In Figure 1.4,we show four examples of the results of fitting polynomials having orders M =0,1,3,and 9 to the data set shown in Figure 1.2. We notice that the constant (M =0)and first order (M 1)polynomials give rather poor fits to the data and consequently rather poor representations of the function sin(2mx).The third order (M=3)polynomial seems to give the best fit to the function sin(2x)of the examples shown in Figure 1.4.When we go to a much higher order polynomial (M=9),we obtain an excellent fit to the training data.In fact,the polynomial passes exactly through each data point and E(w*)=0. However,the fitted curve oscillates wildly and gives a very poor representation of the function sin(2m).This latter behaviour is known as over-fitting. As we have noted earlier,the goal is to achieve good generalization by making accurate predictions for new data.We can obtain some quantitative insight into the dependence of the generalization performance on M by considering a separate test set comprising 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values.For each choice of M,we can then evaluate the residual value of E(w*)given by (1.2)for the training data,and we can also evaluate E(w*) for the test data set.It is sometimes more convenient to use the root-mean-square
6 1. INTRODUCTION Figure 1.3 The error function (1.2) corresponds to (one half of) the sum of the squares of the displacements (shown by the vertical green bars) of each data point from the function y(x, w). t x y(xn, w) tn xn function y(x, w) were to pass exactly through each training data point. The geometrical interpretation of the sum-of-squares error function is illustrated in Figure 1.3. We can solve the curve fitting problem by choosing the value of w for which E(w) is as small as possible. Because the error function is a quadratic function of the coefficients w, its derivatives with respect to the coefficients will be linear in the elements of w, and so the minimization of the error function has a unique solution, denoted by w⋆ Exercise 1.1 , which can be found in closed form. The resulting polynomial is given by the function y(x, w⋆). There remains the problem of choosing the order M of the polynomial, and as we shall see this will turn out to be an example of an important concept called model comparison or model selection. In Figure 1.4, we show four examples of the results of fitting polynomials having orders M = 0, 1, 3, and 9 to the data set shown in Figure 1.2. We notice that the constant (M = 0) and first order (M = 1) polynomials give rather poor fits to the data and consequently rather poor representations of the function sin(2πx). The third order (M = 3) polynomial seems to give the best fit to the function sin(2πx) of the examples shown in Figure 1.4. When we go to a much higher order polynomial (M = 9), we obtain an excellent fit to the training data. In fact, the polynomial passes exactly through each data point and E(w⋆)=0. However, the fitted curve oscillates wildly and gives a very poor representation of the function sin(2πx). This latter behaviour is known as over-fitting. As we have noted earlier, the goal is to achieve good generalization by making accurate predictions for new data. We can obtain some quantitative insight into the dependence of the generalization performance on M by considering a separate test set comprising 100 data points generated using exactly the same procedure used to generate the training set points but with new choices for the random noise values included in the target values. For each choice of M, we can then evaluate the residual value of E(w⋆) given by (1.2) for the training data, and we can also evaluate E(w⋆) for the test data set. It is sometimes more convenient to use the root-mean-square
1.1.Example:Polynomial Curve Fitting > M=0 M=1 O 0 0 0 1M=3 M=9 0 0 0 Figure 1.4 Plots of polynomials having various orders M,shown as red curves,fitted to the data set shown in Figure 1.2. (RMS)error defined by ERMS V2E(W*)/N (1.3) in which the division by N allows us to compare different sizes of data sets on an equal footing,and the square root ensures that ERMs is measured on the same scale (and in the same units)as the target variable t.Graphs of the training and test set RMS errors are shown,for various values of M,in Figure 1.5.The test set error is a measure of how well we are doing in predicting the values of t for new data observations of We note from Figure 1.5 that small values of M give relatively large values of the test set error,and this can be attributed to the fact that the corresponding polynomials are rather inflexible and are incapable of capturing the oscillations in the function sin(2mx).Values of M in the range 3<M <8 give small values for the test set error,and these also give reasonable representations of the generating function sin(2mx),as can be seen,for the case of M 3,from Figure 1.4
1.1. Example: Polynomial Curve Fitting 7 x t M = 0 0 1 −1 0 1 x t M = 1 0 1 −1 0 1 x t M = 3 0 1 −1 0 1 x t M = 9 0 1 −1 0 1 Figure 1.4 Plots of polynomials having various orders M, shown as red curves, fitted to the data set shown in Figure 1.2. (RMS) error defined by ERMS = #2E(w⋆)/N (1.3) in which the division by N allows us to compare different sizes of data sets on an equal footing, and the square root ensures that ERMS is measured on the same scale (and in the same units) as the target variable t. Graphs of the training and test set RMS errors are shown, for various values of M, in Figure 1.5. The test set error is a measure of how well we are doing in predicting the values of t for new data observations of x. We note from Figure 1.5 that small values of M give relatively large values of the test set error, and this can be attributed to the fact that the corresponding polynomials are rather inflexible and are incapable of capturing the oscillations in the function sin(2πx). Values of M in the range 3 ! M ! 8 give small values for the test set error, and these also give reasonable representations of the generating function sin(2πx), as can be seen, for the case of M = 3, from Figure 1.4
8 1.INTRODUCTION Figure 1.5 Graphs of the root-mean-square error,defined by (1.3),evaluated on the training set and on an inde- eTraining pendent test set for various values e-Test of M. 0.5 0Q 0 3 M 6 For M =9,the training set error goes to zero,as we might expect because this polynomial contains 10 degrees of freedom corresponding to the 10 coefficients wo,...,wg,and so can be tuned exactly to the 10 data points in the training set. However,the test set error has become very large and,as we saw in Figure 1.4,the corresponding function y(t,w*)exhibits wild oscillations. This may seem paradoxical because a polynomial of given order contains all lower order polynomials as special cases.The M =9 polynomial is therefore capa- ble of generating results at least as good as the M=3 polynomial.Furthermore,we might suppose that the best predictor of new data would be the function sin(2m) from which the data was generated (and we shall see later that this is indeed the case).We know that a power series expansion of the function sin(2m)contains terms of all orders,so we might expect that results should improve monotonically as we increase M. We can gain some insight into the problem by examining the values of the co- efficients w*obtained from polynomials of various order,as shown in Table 1.1. We see that,as M increases,the magnitude of the coefficients typically gets larger. In particular for the M=9 polynomial,the coefficients have become finely tuned to the data by developing large positive and negative values so that the correspond- Table 1.1 Table of the coefficients w*for M=0M=1 M=6 M=9 polynomials of various order. wo 0.19 0.82 0.31 0.35 Observe how the typical mag- nitude of the coefficients in- wf -1.27 7.99 232.37 creases dramatically as the or- uw吃 -25.43 -5321.83 der of the polynomial increases. W3 17.37 48568.31 u -231639.30 640042.26 % -1061800.52 1042400.18 -557682.99 125201.43
8 1. INTRODUCTION Figure 1.5 Graphs of the root-mean-square error, defined by (1.3), evaluated on the training set and on an independent test set for various values of M. M ERMS 0 3 6 9 0 0.5 1 Training Test For M = 9, the training set error goes to zero, as we might expect because this polynomial contains 10 degrees of freedom corresponding to the 10 coefficients w0,...,w9, and so can be tuned exactly to the 10 data points in the training set. However, the test set error has become very large and, as we saw in Figure 1.4, the corresponding function y(x, w⋆) exhibits wild oscillations. This may seem paradoxical because a polynomial of given order contains all lower order polynomials as special cases. The M = 9 polynomial is therefore capable of generating results at least as good as the M = 3 polynomial. Furthermore, we might suppose that the best predictor of new data would be the function sin(2πx) from which the data was generated (and we shall see later that this is indeed the case). We know that a power series expansion of the function sin(2πx) contains terms of all orders, so we might expect that results should improve monotonically as we increase M. We can gain some insight into the problem by examining the values of the coefficients w⋆ obtained from polynomials of various order, as shown in Table 1.1. We see that, as M increases, the magnitude of the coefficients typically gets larger. In particular for the M = 9 polynomial, the coefficients have become finely tuned to the data by developing large positive and negative values so that the correspondTable 1.1 Table of the coefficients w⋆ for polynomials of various order. Observe how the typical magnitude of the coefficients increases dramatically as the order of the polynomial increases. M = 0 M = 1 M = 6 M = 9 w⋆ 0 0.19 0.82 0.31 0.35 w⋆ 1 -1.27 7.99 232.37 w⋆ 2 -25.43 -5321.83 w⋆ 3 17.37 48568.31 w⋆ 4 -231639.30 w⋆ 5 640042.26 w⋆ 6 -1061800.52 w⋆ 7 1042400.18 w⋆ 8 -557682.99 w⋆ 9 125201.43
1.1.Example:Polynomial Curve Fitting 9 N=15 o880 品 N=100 P Q0 O 6 00, 0 0 1 Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M=9 polynomial for N =15 data points (left plot)and N =100 data points(right plot).We see that increasing the size of the data set reduces the over-fitting problem. ing polynomial function matches each of the data points exactly,but between data points(particularly near the ends of the range)the function exhibits the large oscilla- tions observed in Figure 1.4.Intuitively,what is happening is that the more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values. It is also interesting to examine the behaviour of a given model as the size of the data set is varied,as shown in Figure 1.6.We see that,for a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set,the more complex (in other words more flexible)the model that we can afford to fit to the data.One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10)of the number of adaptive parameters in the model.However,as we shall see in Chapter 3,the number of parameters is not necessarily the most appropriate measure of model complexity. Also,there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set.It would seem more reasonable to choose the complexity of the model according to the com- plexity of the problem being solved.We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood (discussed in Section 1.2.5),and that the over-fitting problem can be understood as Section 3.4 a general property of maximum likelihood.By adopting a Bayesian approach,the over-fitting problem can be avoided.We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points.Indeed,in a Bayesian model the effective number of parameters adapts automatically to the size of the data set. For the moment,however,it is instructive to continue with the current approach and to consider how in practice we can apply it to data sets of limited size where we
1.1. Example: Polynomial Curve Fitting 9 x t N = 15 0 1 −1 0 1 x t N = 100 0 1 −1 0 1 Figure 1.6 Plots of the solutions obtained by minimizing the sum-of-squares error function using the M = 9 polynomial for N = 15 data points (left plot) and N = 100 data points (right plot). We see that increasing the size of the data set reduces the over-fitting problem. ing polynomial function matches each of the data points exactly, but between data points (particularly near the ends of the range) the function exhibits the large oscillations observed in Figure 1.4. Intuitively, what is happening is that the more flexible polynomials with larger values of M are becoming increasingly tuned to the random noise on the target values. It is also interesting to examine the behaviour of a given model as the size of the data set is varied, as shown in Figure 1.6. We see that, for a given model complexity, the over-fitting problem become less severe as the size of the data set increases. Another way to say this is that the larger the data set, the more complex (in other words more flexible) the model that we can afford to fit to the data. One rough heuristic that is sometimes advocated is that the number of data points should be no less than some multiple (say 5 or 10) of the number of adaptive parameters in the model. However, as we shall see in Chapter 3, the number of parameters is not necessarily the most appropriate measure of model complexity. Also, there is something rather unsatisfying about having to limit the number of parameters in a model according to the size of the available training set. It would seem more reasonable to choose the complexity of the model according to the complexity of the problem being solved. We shall see that the least squares approach to finding the model parameters represents a specific case of maximum likelihood (discussed in Section 1.2.5), and that the over-fitting problem can be understood as Section 3.4 a general property of maximum likelihood. By adopting a Bayesian approach, the over-fitting problem can be avoided. We shall see that there is no difficulty from a Bayesian perspective in employing models for which the number of parameters greatly exceeds the number of data points. Indeed, in a Bayesian model the effective number of parameters adapts automatically to the size of the data set. For the moment, however, it is instructive to continue with the current approach and to consider how in practice we can apply it to data sets of limited size where we
10 1.INTRODUCTION nλ=-18 ln入=0 ● 0 0 Figure 1.7 Plots of M=9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error function (1.4)for two values of the regularization parameter A corresponding to In A=-18 and In A=0.The case of no regularizer,i.e.,A=0,corresponding to In A=-oo,is shown at the bottom right of Figure 1.4. may wish to use relatively complex and flexible models.One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function(1.2)in order to discourage the coefficients from reaching large values.The simplest such penalty term takes the form of a sum of squares of all of the coefficients,leading to a modified error function of the form (w)= fu(cn.w)-ta}2+jlwlP (1.4) n=1 where w=wTw=w+w+...+,and the coefficient A governs the rel- ative importance of the regularization term compared with the sum-of-squares error term.Note that often the coefficient wo is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al.,2001),or it may be included but with its own regularization coefficient (we shall discuss this topic in more detail in Section 5.5.1).Again,the error function Exercise 1.2 in(1.4)can be minimized exactly in closed form.Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients.The particular case of a quadratic regularizer is called ridge regres- sion(Hoerl and Kennard,1970).In the context of neural networks,this approach is known as weight decay. Figure 1.7 shows the results of fitting the polynomial of order M =9 to the same data set as before but now using the regularized error function given by (1.4). We see that,for a value of In A =-18,the over-fitting has been suppressed and we now obtain a much closer representation of the underlying function sin(2m).If, however,we use too large a value for A then we again obtain a poor fit,as shown in Figure 1.7 for In A =0.The corresponding coefficients from the fitted polynomials are given in Table 1.2,showing that regularization has the desired effect of reducing
10 1. INTRODUCTION x t ln λ = −18 0 1 −1 0 1 x t ln λ = 0 0 1 −1 0 1 Figure 1.7 Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error function (1.4) for two values of the regularization parameter λ corresponding to ln λ = −18 and ln λ = 0. The case of no regularizer, i.e., λ = 0, corresponding to ln λ = −∞, is shown at the bottom right of Figure 1.4. may wish to use relatively complex and flexible models. One technique that is often used to control the over-fitting phenomenon in such cases is that of regularization, which involves adding a penalty term to the error function (1.2) in order to discourage the coefficients from reaching large values. The simplest such penalty term takes the form of a sum of squares of all of the coefficients, leading to a modified error function of the form E$(w) = 1 2 " N n=1 {y(xn, w) − tn} 2 + λ 2 ∥w∥2 (1.4) where ∥w∥2 ≡ wTw = w2 0 + w2 1 + ... + w2 M, and the coefficient λ governs the relative importance of the regularization term compared with the sum-of-squares error term. Note that often the coefficient w0 is omitted from the regularizer because its inclusion causes the results to depend on the choice of origin for the target variable (Hastie et al., 2001), or it may be included but with its own regularization coefficient (we shall discuss this topic in more detail in Section 5.5.1). Again, the error function Exercise 1.2 in (1.4) can be minimized exactly in closed form. Techniques such as this are known in the statistics literature as shrinkage methods because they reduce the value of the coefficients. The particular case of a quadratic regularizer is called ridge regression (Hoerl and Kennard, 1970). In the context of neural networks, this approach is known as weight decay. Figure 1.7 shows the results of fitting the polynomial of order M = 9 to the same data set as before but now using the regularized error function given by (1.4). We see that, for a value of ln λ = −18, the over-fitting has been suppressed and we now obtain a much closer representation of the underlying function sin(2πx). If, however, we use too large a value for λ then we again obtain a poor fit, as shown in Figure 1.7 for ln λ = 0. The corresponding coefficients from the fitted polynomials are given in Table 1.2, showing that regularization has the desired effect of reducing