Introduction The problem of searching for patterns in data is a fundamental one and has a long and successful history.For instance,the extensive astronomical observations of Tycho Brahe in the 16th century allowed Johannes Kepler to discover the empirical laws of planetary motion,which in turn provided a springboard for the development of clas- sical mechanics.Similarly,the discovery of regularities in atomic spectra played a key role in the development and verification of quantum physics in the early twenti- eth century.The field of pattern recognition is concerned with the automatic discov- ery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories. Consider the example of recognizing handwritten digits,illustrated in Figure 1.1. Each digit corresponds to a 28 x 28 pixel image and so can be represented by a vector x comprising 784 real numbers.The goal is to build a machine that will take such a vector x as input and that will produce the identity of the digit 0,...,9 as the output. This is a nontrivial problem due to the wide variability of handwriting.It could be 1
1 Introduction The problem of searching for patterns in data is a fundamental one and has a long and successful history. For instance, the extensive astronomical observations of Tycho Brahe in the 16th century allowed Johannes Kepler to discover the empirical laws of planetary motion, which in turn provided a springboard for the development of classical mechanics. Similarly, the discovery of regularities in atomic spectra played a key role in the development and verification of quantum physics in the early twentieth century. The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories. Consider the example of recognizing handwritten digits, illustrated in Figure 1.1. Each digit corresponds to a 28×28 pixel image and so can be represented by a vector x comprising 784 real numbers. The goal is to build a machine that will take such a vector x as input and that will produce the identity of the digit 0,..., 9 as the output. This is a nontrivial problem due to the wide variability of handwriting. It could be 1
2 1.INTRODUCTION Figure 1.1 Examples of hand-written dig- its taken from US zip codes. 4 tackled using handcrafted rules or heuristics for distinguishing the digits based on the shapes of the strokes,but in practice such an approach leads to a proliferation of rules and of exceptions to the rules and so on,and invariably gives poor results. Far better results can be obtained by adopting a machine learning approach in which a large set of N digits [x1,...,xN}called a training set is used to tune the parameters of an adaptive model.The categories of the digits in the training set are known in advance,typically by inspecting them individually and hand-labelling them.We can express the category of a digit using target vector t,which represents the identity of the corresponding digit.Suitable techniques for representing cate- gories in terms of vectors will be discussed later.Note that there is one such target vector t for each digit image x. The result of running the machine learning algorithm can be expressed as a function y(x)which takes a new digit image x as input and that generates an output vector y,encoded in the same way as the target vectors.The precise form of the function y(x)is determined during the training phase,also known as the learning phase,on the basis of the training data.Once the model is trained it can then de- termine the identity of new digit images,which are said to comprise a test set.The ability to categorize correctly new examples that differ from those used for train- ing is known as generalization.In practical applications,the variability of the input vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors,and so generalization is a central goal in pattern recognition. For most practical applications,the original input variables are typically prepro- cessed to transform them into some new space of variables where,it is hoped,the pattern recognition problem will be easier to solve.For instance,in the digit recogni- tion problem,the images of the digits are typically translated and scaled so that each digit is contained within a box of a fixed size.This greatly reduces the variability within each digit class,because the location and scale of all the digits are now the same,which makes it much easier for a subsequent pattern recognition algorithm to distinguish between the different classes.This pre-processing stage is sometimes also called feature extraction.Note that new test data must be pre-processed using the same steps as the training data. Pre-processing might also be performed in order to speed up computation.For example,if the goal is real-time face detection in a high-resolution video stream, the computer must handle huge numbers of pixels per second,and presenting these directly to a complex pattern recognition algorithm may be computationally infeasi- ble.Instead,the aim is to find useful features that are fast to compute,and yet that
2 1. INTRODUCTION Figure 1.1 Examples of hand-written digits taken from US zip codes. tackled using handcrafted rules or heuristics for distinguishing the digits based on the shapes of the strokes, but in practice such an approach leads to a proliferation of rules and of exceptions to the rules and so on, and invariably gives poor results. Far better results can be obtained by adopting a machine learning approach in which a large set of N digits {x1,..., xN } called a training set is used to tune the parameters of an adaptive model. The categories of the digits in the training set are known in advance, typically by inspecting them individually and hand-labelling them. We can express the category of a digit using target vector t, which represents the identity of the corresponding digit. Suitable techniques for representing categories in terms of vectors will be discussed later. Note that there is one such target vector t for each digit image x. The result of running the machine learning algorithm can be expressed as a function y(x) which takes a new digit image x as input and that generates an output vector y, encoded in the same way as the target vectors. The precise form of the function y(x) is determined during the training phase, also known as the learning phase, on the basis of the training data. Once the model is trained it can then determine the identity of new digit images, which are said to comprise a test set. The ability to categorize correctly new examples that differ from those used for training is known as generalization. In practical applications, the variability of the input vectors will be such that the training data can comprise only a tiny fraction of all possible input vectors, and so generalization is a central goal in pattern recognition. For most practical applications, the original input variables are typically preprocessed to transform them into some new space of variables where, it is hoped, the pattern recognition problem will be easier to solve. For instance, in the digit recognition problem, the images of the digits are typically translated and scaled so that each digit is contained within a box of a fixed size. This greatly reduces the variability within each digit class, because the location and scale of all the digits are now the same, which makes it much easier for a subsequent pattern recognition algorithm to distinguish between the different classes. This pre-processing stage is sometimes also called feature extraction. Note that new test data must be pre-processed using the same steps as the training data. Pre-processing might also be performed in order to speed up computation. For example, if the goal is real-time face detection in a high-resolution video stream, the computer must handle huge numbers of pixels per second, and presenting these directly to a complex pattern recognition algorithm may be computationally infeasible. Instead, the aim is to find useful features that are fast to compute, and yet that
1.INTRODUCTION 3 also preserve useful discriminatory information enabling faces to be distinguished from non-faces.These features are then used as the inputs to the pattern recognition algorithm.For instance,the average value of the image intensity over a rectangular subregion can be evaluated extremely efficiently (Viola and Jones,2004),and a set of such features can prove very effective in fast face detection.Because the number of such features is smaller than the number of pixels,this kind of pre-processing repre- sents a form of dimensionality reduction.Care must be taken during pre-processing because often information is discarded,and if this information is important to the solution of the problem then the overall accuracy of the system can suffer. Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning prob- lems.Cases such as the digit recognition example,in which the aim is to assign each input vector to one of a finite number of discrete categories,are called classification problems.If the desired output consists of one or more continuous variables,then the task is called regression.An example of a regression problem would be the pre- diction of the yield in a chemical manufacturing process in which the inputs consist of the concentrations of reactants,the temperature,and the pressure. In other pattern recognition problems,the training data consists of a set of input vectors x without any corresponding target values.The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering,or to determine the distribution of data within the input space,known as density estimation,or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization. Finally,the technique of reinforcement learning (Sutton and Barto,1998)is con- cerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward.Here the learning algorithm is not given examples of optimal outputs,in contrast to supervised learning,but must instead discover them by a process of trial and error.Typically there is a sequence of states and actions in which the learning algorithm is interacting with its environment.In many cases,the current action not only affects the immediate reward but also has an impact on the re- ward at all subsequent time steps.For example,by using appropriate reinforcement learning techniques a neural network can learn to play the game of backgammon to a high standard (Tesauro,1994).Here the network must learn to take a board position as input,along with the result of a dice throw,and produce a strong move as the output.This is done by having the network play against a copy of itself for perhaps a million games.A major challenge is that a game of backgammon can involve dozens of moves,and yet it is only at the end of the game that the reward,in the form of victory,is achieved.The reward must then be attributed appropriately to all of the moves that led to it,even though some moves will have been good ones and others less so.This is an example of a credit assignment problem.A general feature of re- inforcement learning is the trade-off between exploration,in which the system tries out new kinds of actions to see how effective they are,and exploitation,in which the system makes use of actions that are known to yield a high reward.Too strong a focus on either exploration or exploitation will yield poor results.Reinforcement learning continues to be an active area of machine learning research.However,a
1. INTRODUCTION 3 also preserve useful discriminatory information enabling faces to be distinguished from non-faces. These features are then used as the inputs to the pattern recognition algorithm. For instance, the average value of the image intensity over a rectangular subregion can be evaluated extremely efficiently (Viola and Jones, 2004), and a set of such features can prove very effective in fast face detection. Because the number of such features is smaller than the number of pixels, this kind of pre-processing represents a form of dimensionality reduction. Care must be taken during pre-processing because often information is discarded, and if this information is important to the solution of the problem then the overall accuracy of the system can suffer. Applications in which the training data comprises examples of the input vectors along with their corresponding target vectors are known as supervised learning problems. Cases such as the digit recognition example, in which the aim is to assign each input vector to one of a finite number of discrete categories, are called classification problems. If the desired output consists of one or more continuous variables, then the task is called regression. An example of a regression problem would be the prediction of the yield in a chemical manufacturing process in which the inputs consist of the concentrations of reactants, the temperature, and the pressure. In other pattern recognition problems, the training data consists of a set of input vectors x without any corresponding target values. The goal in such unsupervised learning problems may be to discover groups of similar examples within the data, where it is called clustering, or to determine the distribution of data within the input space, known as density estimation, or to project the data from a high-dimensional space down to two or three dimensions for the purpose of visualization. Finally, the technique of reinforcement learning (Sutton and Barto, 1998) is concerned with the problem of finding suitable actions to take in a given situation in order to maximize a reward. Here the learning algorithm is not given examples of optimal outputs, in contrast to supervised learning, but must instead discover them by a process of trial and error. Typically there is a sequence of states and actions in which the learning algorithm is interacting with its environment. In many cases, the current action not only affects the immediate reward but also has an impact on the reward at all subsequent time steps. For example, by using appropriate reinforcement learning techniques a neural network can learn to play the game of backgammon to a high standard (Tesauro, 1994). Here the network must learn to take a board position as input, along with the result of a dice throw, and produce a strong move as the output. This is done by having the network play against a copy of itself for perhaps a million games. A major challenge is that a game of backgammon can involve dozens of moves, and yet it is only at the end of the game that the reward, in the form of victory, is achieved. The reward must then be attributed appropriately to all of the moves that led to it, even though some moves will have been good ones and others less so. This is an example of a credit assignment problem. A general feature of reinforcement learning is the trade-off between exploration, in which the system tries out new kinds of actions to see how effective they are, and exploitation, in which the system makes use of actions that are known to yield a high reward. Too strong a focus on either exploration or exploitation will yield poor results. Reinforcement learning continues to be an active area of machine learning research. However, a
4 1.INTRODUCTION Figure 1.2 Plot of a training data set of N 10 points,shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t t.The green curve shows the function sin(2)used to gener- ate the data.Our goal is to pre- dict the value of t for some new value of r,without knowledge of the green curve. 0 detailed treatment lies beyond the scope of this book. Although each of these tasks needs its own tools and techniques,many of the key ideas that underpin them are common to all such problems.One of the main goals of this chapter is to introduce,in a relatively informal way,several of the most important of these concepts and to illustrate them using simple examples.Later in the book we shall see these same ideas re-emerge in the context of more sophisti- cated models that are applicable to real-world pattern recognition applications.This chapter also provides a self-contained introduction to three important tools that will be used throughout the book,namely probability theory,decision theory,and infor- mation theory.Although these might sound like daunting topics,they are in fact straightforward,and a clear understanding of them is essential if machine learning techniques are to be used to best effect in practical applications. 1.1.Example:Polynomial Curve Fitting We begin by introducing a simple regression problem,which we shall use as a run- ning example throughout this chapter to motivate a number of key concepts.Sup- pose we observe a real-valued input variable r and we wish to use this observation to predict the value of a real-valued target variable t.For the present purposes,it is in- structive to consider an artificial example using synthetically generated data because we then know the precise process that generated the data for comparison against any learned model.The data for this example is generated from the function sin(2m) with random noise included in the target values,as described in detail in Appendix A. Now suppose that we are given a training set comprising N observations of written x=(1,...,N)T,together with corresponding observations of the values of t,denotedt=(t,...,tN).Figure 1.2 shows a plot of a training set comprising N 10 data points.The input data set x in Figure 1.2 was generated by choos- ing values of cn,for n 1,...,N,spaced uniformly in range 0,1,and the target data set t was obtained by first computing the corresponding values of the function
4 1. INTRODUCTION Figure 1.2 Plot of a training data set of N = 10 points, shown as blue circles, each comprising an observation of the input variable x along with the corresponding target variable t. The green curve shows the function sin(2πx) used to generate the data. Our goal is to predict the value of t for some new value of x, without knowledge of the green curve. x t 0 1 −1 0 1 detailed treatment lies beyond the scope of this book. Although each of these tasks needs its own tools and techniques, many of the key ideas that underpin them are common to all such problems. One of the main goals of this chapter is to introduce, in a relatively informal way, several of the most important of these concepts and to illustrate them using simple examples. Later in the book we shall see these same ideas re-emerge in the context of more sophisticated models that are applicable to real-world pattern recognition applications. This chapter also provides a self-contained introduction to three important tools that will be used throughout the book, namely probability theory, decision theory, and information theory. Although these might sound like daunting topics, they are in fact straightforward, and a clear understanding of them is essential if machine learning techniques are to be used to best effect in practical applications. 1.1. Example: Polynomial Curve Fitting We begin by introducing a simple regression problem, which we shall use as a running example throughout this chapter to motivate a number of key concepts. Suppose we observe a real-valued input variable x and we wish to use this observation to predict the value of a real-valued target variable t. For the present purposes, it is instructive to consider an artificial example using synthetically generated data because we then know the precise process that generated the data for comparison against any learned model. The data for this example is generated from the function sin(2πx) with random noise included in the target values, as described in detail in Appendix A. Now suppose that we are given a training set comprising N observations of x, written x ≡ (x1,...,xN )T, together with corresponding observations of the values of t, denoted t ≡ (t1,...,tN )T. Figure 1.2 shows a plot of a training set comprising N = 10 data points. The input data set x in Figure 1.2 was generated by choosing values of xn, for n = 1,...,N, spaced uniformly in range [0, 1], and the target data set t was obtained by first computing the corresponding values of the function
1.1.Example:Polynomial Curve Fitting 5 sin(2mx)and then adding a small level of random noise having a Gaussian distri- bution (the Gaussian distribution is discussed in Section 1.2.4)to each such point in order to obtain the corresponding value tn.By generating data in this way,we are capturing a property of many real data sets,namely that they possess an underlying regularity,which we wish to learn,but that individual observations are corrupted by random noise.This noise might arise from intrinsically stochastic (i.e.random)pro- cesses such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved. Our goal is to exploit this training set in order to make predictions of the value t of the target variable for some new value of the input variable.As we shall see later,this involves implicitly trying to discover the underlying function sin(2). This is intrinsically a difficult problem as we have to generalize from a finite data set.Furthermore the observed data are corrupted with noise,and so for a given there is uncertainty as to the appropriate value for t.Probability theory,discussed in Section 1.2,provides a framework for expressing such uncertainty in a precise and quantitative manner,and decision theory,discussed in Section 1.5,allows us to exploit this probabilistic representation in order to make predictions that are optimal according to appropriate criteria. For the moment,however,we shall proceed rather informally and consider a simple approach based on curve fitting.In particular,we shall fit the data using a polynomial function of the form M z,W)=0+01x+2x2+.十0MxM (1.1) j=0 where M is the order of the polynomial,and r denotes x raised to the power of j. The polynomial coefficients wo,...,wM are collectively denoted by the vector w. Note that,although the polynomial function y(x,w)is a nonlinear function of x,it is a linear function of the coefficients w.Functions,such as the polynomial,which are linear in the unknown parameters have important properties and are called linear models and will be discussed extensively in Chapters 3 and 4. The values of the coefficients will be determined by fitting the polynomial to the training data.This can be done by minimizing an error function that measures the misfit between the function y(,w),for any given value of w,and the training set data points.One simple choice of error function,which is widely used,is given by the sum of the squares of the errors between the predictions y(rn,w)for each data point n and the corresponding target values tn,so that we minimize E(w)= y(En,w)-tn)2 (1.2) n=1 where the factor of 1/2 is included for later convenience.We shall discuss the mo- tivation for this choice of error function later in this chapter.For the moment we simply note that it is a nonnegative quantity that would be zero if,and only if,the
1.1. Example: Polynomial Curve Fitting 5 sin(2πx) and then adding a small level of random noise having a Gaussian distribution (the Gaussian distribution is discussed in Section 1.2.4) to each such point in order to obtain the corresponding value tn. By generating data in this way, we are capturing a property of many real data sets, namely that they possess an underlying regularity, which we wish to learn, but that individual observations are corrupted by random noise. This noise might arise from intrinsically stochastic (i.e. random) processes such as radioactive decay but more typically is due to there being sources of variability that are themselves unobserved. Our goal is to exploit this training set in order to make predictions of the value !t of the target variable for some new value !x of the input variable. As we shall see later, this involves implicitly trying to discover the underlying function sin(2πx). This is intrinsically a difficult problem as we have to generalize from a finite data set. Furthermore the observed data are corrupted with noise, and so for a given !x there is uncertainty as to the appropriate value for !t. Probability theory, discussed in Section 1.2, provides a framework for expressing such uncertainty in a precise and quantitative manner, and decision theory, discussed in Section 1.5, allows us to exploit this probabilistic representation in order to make predictions that are optimal according to appropriate criteria. For the moment, however, we shall proceed rather informally and consider a simple approach based on curve fitting. In particular, we shall fit the data using a polynomial function of the form y(x, w) = w0 + w1x + w2x2 + ... + wMxM = " M j=0 wjxj (1.1) where M is the order of the polynomial, and xj denotes x raised to the power of j. The polynomial coefficients w0,...,wM are collectively denoted by the vector w. Note that, although the polynomial function y(x, w) is a nonlinear function of x, it is a linear function of the coefficients w. Functions, such as the polynomial, which are linear in the unknown parameters have important properties and are called linear models and will be discussed extensively in Chapters 3 and 4. The values of the coefficients will be determined by fitting the polynomial to the training data. This can be done by minimizing an error function that measures the misfit between the function y(x, w), for any given value of w, and the training set data points. One simple choice of error function, which is widely used, is given by the sum of the squares of the errors between the predictions y(xn, w) for each data point xn and the corresponding target values tn, so that we minimize E(w) = 1 2 " N n=1 {y(xn, w) − tn} 2 (1.2) where the factor of 1/2 is included for later convenience. We shall discuss the motivation for this choice of error function later in this chapter. For the moment we simply note that it is a nonnegative quantity that would be zero if, and only if, the