Mathematical notation I have tried to keep the mathematical content of the book to the minimum neces- sary to achieve a proper understanding of the field.However,this minimum level is nonzero,and it should be emphasized that a good grasp of calculus,linear algebra, and probability theory is essential for a clear understanding of modern pattern recog- nition and machine learning techniques.Nevertheless,the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book,although at times this means departing from some of the conventions used in the corresponding re- search literature.Vectors are denoted by lower case bold Roman letters such as x,and all vectors are assumed to be column vectors.A superscript T denotes the transpose of a matrix or vector,so that xT will be a row vector.Uppercase bold roman letters,such as M,denote matrices.The notation (w1,...,wM)denotes a row vector with M elements,while the corresponding column vector is written as w=(w1,...,wM)T. The notation [a,b is used to denote the closed interval from a to b,that is the interval including the values a and b themselves,while (a,b)denotes the correspond- ing open interval,that is the interval excluding a and b.Similarly,a,b)denotes an interval that includes a but excludes b.For the most part,however,there will be little need to dwell on such refinements as whether the end points of an interval are included or not. The M x M identity matrix (also known as the unit matrix)is denoted IM, which will be abbreviated to I where there is no ambiguity about it dimensionality. It has elements Iij that equal 1 if i=j and 0 if ij. A functional is denoted fy where y(x)is some function.The concept of a functional is discussed in Appendix D. The notation g()=O(f(x))denotes that f()/g(x)is bounded as x-oo For instance if g(x)=3x2 +2,then g(x)=O(x2). The expectation of a function f(r,y)with respect to a random variable r is de- noted by Ef(,y)].In situations where there is no ambiguity as to which variable is being averaged over,this will be simplified by omitting the suffix,for instance xi
Mathematical notation I have tried to keep the mathematical content of the book to the minimum necessary to achieve a proper understanding of the field. However, this minimum level is nonzero, and it should be emphasized that a good grasp of calculus, linear algebra, and probability theory is essential for a clear understanding of modern pattern recognition and machine learning techniques. Nevertheless, the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book, although at times this means departing from some of the conventions used in the corresponding research literature. Vectors are denoted by lower case bold Roman letters such as x, and all vectors are assumed to be column vectors. A superscript T denotes the transpose of a matrix or vector, so that xT will be a row vector. Uppercase bold roman letters, such as M, denote matrices. The notation (w1,...,wM) denotes a row vector with M elements, while the corresponding column vector is written as w = (w1,...,wM)T. The notation [a, b] is used to denote the closed interval from a to b, that is the interval including the values a and b themselves, while (a, b) denotes the corresponding open interval, that is the interval excluding a and b. Similarly, [a, b) denotes an interval that includes a but excludes b. For the most part, however, there will be little need to dwell on such refinements as whether the end points of an interval are included or not. The M × M identity matrix (also known as the unit matrix) is denoted IM, which will be abbreviated to I where there is no ambiguity about it dimensionality. It has elements Iij that equal 1 if i = j and 0 if i ̸= j. A functional is denoted f[y] where y(x) is some function. The concept of a functional is discussed in Appendix D. The notation g(x) = O(f(x)) denotes that |f(x)/g(x)| is bounded as x → ∞. For instance if g(x)=3x2 + 2, then g(x) = O(x2). The expectation of a function f(x, y) with respect to a random variable x is denoted by Ex[f(x, y)]. In situations where there is no ambiguity as to which variable is being averaged over, this will be simplified by omitting the suffix, for instance xi
xii MATHEMATICAL NOTATION Er].If the distribution of r is conditioned on another variable z,then the corre- sponding conditional expectation will be written Ef().Similarly,the variance is denoted varf(),and for vector variables the covariance is written covx,y].We shall also use covx as a shorthand notation for covx,x.The concepts of expecta- tions and covariances are introduced in Section 1.2.2. If we have N values x1,...,xN of a D-dimensional vector x=(1,...,p)T we can combine the observations into a data matrix X in which the nth row of X corresponds to the row vector xT.Thus the n,i element of X corresponds to the ith element of the nth observation xn.For the case of one-dimensional variables we shall denote such a matrix by x,which is a column vector whose nth element is n. Note that x(which has dimensionality N)uses a different typeface to distinguish it from x(which has dimensionality D)
xii MATHEMATICAL NOTATION E[x]. If the distribution of x is conditioned on another variable z, then the corresponding conditional expectation will be written Ex[f(x)|z]. Similarly, the variance is denoted var[f(x)], and for vector variables the covariance is written cov[x, y]. We shall also use cov[x] as a shorthand notation for cov[x, x]. The concepts of expectations and covariances are introduced in Section 1.2.2. If we have N values x1,..., xN of a D-dimensional vector x = (x1,...,xD)T, we can combine the observations into a data matrix X in which the nth row of X corresponds to the row vector xT n. Thus the n, i element of X corresponds to the i th element of the nth observation xn. For the case of one-dimensional variables we shall denote such a matrix by x, which is a column vector whose nth element is xn. Note that x (which has dimensionality N) uses a different typeface to distinguish it from x (which has dimensionality D)
Contents Preface vij Mathematical notation xi Contents xiii 1 Introduction 1 1.1 Example:Polynomial Curve Fitting 4 1.2 Probability Theory 12 1.2.1 Probability densities 7 1.2.2 Expectations and covariances 19 1.2.3 Bayesian probabilities 21 1.2.4 The Gaussian distribution 24 1.2.5 Curve fitting re-visited 2 1.2.6 Bayesian curve fitting 30 1.3 Model Selection...,......。·.··.·········· 1.4 The Curse of Dimensionality.··.················: 3 1.5 Decision Theory························· 8 1.5.1 Minimizing the misclassification rate 39 1.5.2 Minimizing the expected loss........ 41 l.5.3 The reject option...···.·· 1.5.4 Inference and decision.............. 44。 42 1.5.5 Loss functions for regression....... 46 1.6 Information Theory....·..·····. 1.6.1 Relative entropy and mutual information Exercises..............·..·.·...·.· 85568 xiii
Contents Preface vii Mathematical notation xi Contents xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting . . . . ............. 4 1.2 Probability Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 12 1.2.1 Probability densities . . . . . . . . . . . . . . . . . . . . . 17 1.2.2 Expectations and covariances . . . . . . . . . . . . . . . . 19 1.2.3 Bayesian probabilities . . . . . . . . . . . . . . . . . . . . 21 1.2.4 The Gaussian distribution . . . . . . . . . . . . . . . . . . 24 1.2.5 Curve fitting re-visited . . . . . . . . . . . . . . . . . . . . 28 1.2.6 Bayesian curve fitting . . . . . . . . . . . . . . . . . . . . 30 1.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 1.4 The Curse of Dimensionality . . . . . . . . . . . . . . . . . . . . . 33 1.5 Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 1.5.1 Minimizing the misclassification rate . . . . . . . . . . . . 39 1.5.2 Minimizing the expected loss . . . . . . . . . . . . . . . . 41 1.5.3 The reject option . . . . . . . . . . . . . . . . . . . . . . . 42 1.5.4 Inference and decision . . . . . . . . . . . . . . . . . . . . 42 1.5.5 Loss functions for regression . . . . . . . . . . . . . . . . . 46 1.6 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . 48 1.6.1 Relative entropy and mutual information . . . . . . . . . . 55 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 xiii
xiv CONTENTS 2 Probability Distributions 67 2.1 Binary Variables 6 2.1.1 The beta distribution 2.2 Multinomial Variables...... 4 2.2.1 The Dirichlet distribution 76 2.3 The Gaussian Distribution...... 78 2.3.1 Conditional Gaussian distributions.. 85 2.3.2 Marginal Gaussian distributions.. 88 2.3.3 Bayes'theorem for Gaussian variables.. 90 2.3.4 Maximum likelihood for the Gaussian 93 2.3.5 Sequential estimation..................... 94 2.3.6 Bayesian inference for the Gaussian .... 97 2.3.7 Student's t-distribution........ 102 2.3.8 Periodic variables.......·.···- 105 2.3.9 Mixtures of Gaussians 110 2.4 The Exponential Family..... 113 2.4.1 Maximum likelihood and sufficient statistics 116 2.4.2 Conjugate priors 117 2.4.3 Noninformative priors 117 2.5 Nonparametric Methods 120 2.5.1 Kernel density estimators.............. 122 2.5.2 Nearest-neighbour methods··········· 124 127 3 Linear Models for Regression 137 3.1 Linear Basis Function Models..··..·。·.········· 138 3.1.1 Maximum likelihood and least squares....... 。。。 140 3.1.2 Geometry of least squares 143 3.1.3 Sequential learning....... 143 3.1.4 Regularized least squares 144 3.1.5 Multiple outputs..... 146 3.2 The Bias-Variance Decomposition .. 147 3.3 Bayesian Linear Regression 152 3.3.1 Parameter distribution 152 3.3.2 Predictive distribution 156 3.3.3 Equivalent kernel........ 159 3.4 Bayesian Model Comparison...... 161 3.5 The Evidence Approximation······ 165 3.5.1 Evaluation of the evidence function 166 3.5.2 Maximizing the evidence function 168 3.5.3 Effective number of parameters.········ 170 3.6 Limitations of Fixed Basis Functions.......... 172 Exercises 173
xiv CONTENTS 2 Probability Distributions 67 2.1 Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.1.1 The beta distribution . . . . . . . . . . . . . . . . . . . . . 71 2.2 Multinomial Variables . . . . . . . . . . . . . . . . . . . . . . . . 74 2.2.1 The Dirichlet distribution . . . . . . . . . . . . . . . . . . . 76 2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 78 2.3.1 Conditional Gaussian distributions . . . . . . . . . . . . . . 85 2.3.2 Marginal Gaussian distributions . . . . . . . . . . . . . . . 88 2.3.3 Bayes’ theorem for Gaussian variables . . . . . . . . . . . . 90 2.3.4 Maximum likelihood for the Gaussian . . . . . . . . . . . . 93 2.3.5 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 94 2.3.6 Bayesian inference for the Gaussian . . . . . . . . . . . . . 97 2.3.7 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . 102 2.3.8 Periodic variables . . . . . . . . . . . . . . . . . . . . . . . 105 2.3.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 110 2.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 113 2.4.1 Maximum likelihood and sufficient statistics . . . . . . . . 116 2.4.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . 117 2.4.3 Noninformative priors . . . . . . . . . . . . . . . . . . . . 117 2.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 120 2.5.1 Kernel density estimators . . . . . . . . . . . . . . . . . . . 122 2.5.2 Nearest-neighbour methods . . . . . . . . . . . . . . . . . 124 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3 Linear Models for Regression 137 3.1 Linear Basis Function Models . . . . . . . . . . . . . . . . . . . . 138 3.1.1 Maximum likelihood and least squares . . . . . . . . . . . . 140 3.1.2 Geometry of least squares . . . . . . . . . . . . . . . . . . 143 3.1.3 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 143 3.1.4 Regularized least squares . . . . . . . . . . . . . . . . . . . 144 3.1.5 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 146 3.2 The Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . 147 3.3 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . 152 3.3.1 Parameter distribution . . . . . . . . . . . . . . . . . . . . 152 3.3.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 156 3.3.3 Equivalent kernel . . . . . . . . . . . . . . . . . . . . . . . 159 3.4 Bayesian Model Comparison . . . . . . . . . . . . . . . . . . . . . 161 3.5 The Evidence Approximation . . . . . . . . . . . . . . . . . . . . 165 3.5.1 Evaluation of the evidence function . . . . . . . . . . . . . 166 3.5.2 Maximizing the evidence function . . . . . . . . . . . . . . 168 3.5.3 Effective number of parameters . . . . . . . . . . . . . . . 170 3.6 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 172 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
CONTENTS XV 4 Linear Models for Classification 179 4.1 Discriminant Functions..·.···················· 181 4.l.1Tw0 classes.......·..。·.。···..·。···· 181 4.l.2 Multiple classes.·.··················· 182 4.l.3 Least squares for classification.......··..···· 184 4.1.4 Fisher's linear discriminant........ 186 4.1.5 Relation to least squares... 189 4.1.6 Fisher's discriminant for multiple classes 191 4.l.7 The perceptron algorithm..············· 192 4.2 Probabilistic Generative Models...··.········ 196 4.2.1 Continuous inputs................ 198 4.2.2 Maximum likelihood solution 200 4.2.3 Discrete features....·。···. 202 4.2.4 Exponential family......... 202 4.3 Probabilistic Discriminative Models.. 203 4.3.1 Fixed basis functions...... 204 4.3.2 Logistic regression········· 205 4.3.3 Iterative reweighted least squares 44.4..404。 207 4.3.4 Multiclass logistic regression.... 209 4.3.5 Probit regression··.··· 210 4.3.6 Canonical link functions............ 212 4.4 The Laplace Approximation .. 213 4.4.1 Model comparison and BIC 216 4.5 Bayesian Logistic Regression 217 4.5.1 Laplace approximation 217 4.5.2 Predictive distribution 218 Exercises············· 220 5 Neural Networks 225 5.1 Feed-forward Network Functions 227 5.1.1 Weight-space symmetries 231 5.2 Network Training....。·..· 232 5.2.1 Parameter optimization... 236 5.2.2 Local quadratic approximation 237 5.2.3 Use of gradient information 239 5.2.4 Gradient descent optimization...·.·..·.······ 240 5.3 Error Backpropagation..·..·...··.······.· 241 5.3.1 Evaluation of error-function derivatives 242 5.3.2 A simple example 245 5.3.3 Efficiency of backpropagation 246 5.3.4 The Jacobian matrix.···. 247 5.4 The Hessian Matrix.......·......·.·.· 249 5.4.1 Diagonal approximation 250 5.4.2 Outer product approximation.·.······· 251 5.4.3 Inverse Hessian.............. 252
CONTENTS xv 4 Linear Models for Classification 179 4.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 181 4.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 182 4.1.3 Least squares for classification . . . . . . . . . . . . . . . . 184 4.1.4 Fisher’s linear discriminant . . . . . . . . . . . . . . . . . . 186 4.1.5 Relation to least squares . . . . . . . . . . . . . . . . . . . 189 4.1.6 Fisher’s discriminant for multiple classes . . . . . . . . . . 191 4.1.7 The perceptron algorithm . . . . . . . . . . . . . . . . . . . 192 4.2 Probabilistic Generative Models . . . . . . . . . . . . . . . . . . . 196 4.2.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 198 4.2.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 200 4.2.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 202 4.2.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 202 4.3 Probabilistic Discriminative Models . . . . . . . . . . . . . . . . . 203 4.3.1 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 204 4.3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 205 4.3.3 Iterative reweighted least squares . . . . . . . . . . . . . . 207 4.3.4 Multiclass logistic regression . . . . . . . . . . . . . . . . . 209 4.3.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 210 4.3.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 212 4.4 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . 213 4.4.1 Model comparison and BIC . . . . . . . . . . . . . . . . . 216 4.5 Bayesian Logistic Regression . . . . . . . . . . . . . . . . . . . . 217 4.5.1 Laplace approximation . . . . . . . . . . . . . . . . . . . . 217 4.5.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 218 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 5 Neural Networks 225 5.1 Feed-forward Network Functions . . . . . . . . . . . . . . . . . . 227 5.1.1 Weight-space symmetries . . . . . . . . . . . . . . . . . . 231 5.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.2.1 Parameter optimization . . . . . . . . . . . . . . . . . . . . 236 5.2.2 Local quadratic approximation . . . . . . . . . . . . . . . . 237 5.2.3 Use of gradient information . . . . . . . . . . . . . . . . . 239 5.2.4 Gradient descent optimization . . . . . . . . . . . . . . . . 240 5.3 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 241 5.3.1 Evaluation of error-function derivatives . . . . . . . . . . . 242 5.3.2 A simple example . . . . . . . . . . . . . . . . . . . . . . 245 5.3.3 Efficiency of backpropagation . . . . . . . . . . . . . . . . 246 5.3.4 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 247 5.4 The Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 249 5.4.1 Diagonal approximation . . . . . . . . . . . . . . . . . . . 250 5.4.2 Outer product approximation . . . . . . . . . . . . . . . . . 251 5.4.3 Inverse Hessian . . . . . . . . . . . . . . . . . . . . . . . . 252