当前位置：和泉文库 > 电子与通信 > 浏览文档

《贝叶斯学习与随机矩阵及在无线通信中的应用 BI-RM-AWC》课程教学资源（文献书籍）Pattern Recognition and Machine Learning

1 Introduction 2 Probability Distributions 3 Linear Models for Regression 4 Linear Models for Classification 5 Neural Networks 6 Kernel Methods 7 Sparse Kernel Machines 8 Graphical Models 9 Mixture Models and EM 10 Approximate Inference 11 Sampling Methods 12 Continuous Latent Variables 13 Sequential Data 14 Combining Models

文件格式：PDF，文件大小：7.32MB，售价：70.2元

共700页，可试读40页，点击往前阅读 ↑↑

文档详细内容（约700页）

Mathematical notation I have tried to keep the mathematical content of the book to the minimum neces- sary to achieve a proper understanding of the field.However,this minimum level is nonzero,and it should be emphasized that a good grasp of calculus,linear algebra, and probability theory is essential for a clear understanding of modern pattern recog- nition and machine learning techniques.Nevertheless,the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book,although at times this means departing from some of the conventions used in the corresponding re- search literature.Vectors are denoted by lower case bold Roman letters such as x,and all vectors are assumed to be column vectors.A superscript T denotes the transpose of a matrix or vector,so that xT will be a row vector.Uppercase bold roman letters,such as M,denote matrices.The notation (w1,...,wM)denotes a row vector with M elements,while the corresponding column vector is written as w=(w1,...,wM)T. The notation [a,b is used to denote the closed interval from a to b,that is the interval including the values a and b themselves,while (a,b)denotes the correspond- ing open interval,that is the interval excluding a and b.Similarly,a,b)denotes an interval that includes a but excludes b.For the most part,however,there will be little need to dwell on such refinements as whether the end points of an interval are included or not. The M x M identity matrix (also known as the unit matrix)is denoted IM, which will be abbreviated to I where there is no ambiguity about it dimensionality. It has elements Iij that equal 1 if i=j and 0 if ij. A functional is denoted fy where y(x)is some function.The concept of a functional is discussed in Appendix D. The notation g()=O(f(x))denotes that f()/g(x)is bounded as x-oo For instance if g(x)=3x2 +2,then g(x)=O(x2). The expectation of a function f(r,y)with respect to a random variable r is de- noted by Ef(,y)].In situations where there is no ambiguity as to which variable is being averaged over,this will be simplified by omitting the suffix,for instance xi

Mathematical notation I have tried to keep the mathematical content of the book to the minimum necessary to achieve a proper understanding of the field. However, this minimum level is nonzero, and it should be emphasized that a good grasp of calculus, linear algebra, and probability theory is essential for a clear understanding of modern pattern recognition and machine learning techniques. Nevertheless, the emphasis in this book is on conveying the underlying concepts rather than on mathematical rigour. I have tried to use a consistent notation throughout the book, although at times this means departing from some of the conventions used in the corresponding research literature. Vectors are denoted by lower case bold Roman letters such as x, and all vectors are assumed to be column vectors. A superscript T denotes the transpose of a matrix or vector, so that xT will be a row vector. Uppercase bold roman letters, such as M, denote matrices. The notation (w1,...,wM) denotes a row vector with M elements, while the corresponding column vector is written as w = (w1,...,wM)T. The notation [a, b] is used to denote the closed interval from a to b, that is the interval including the values a and b themselves, while (a, b) denotes the corresponding open interval, that is the interval excluding a and b. Similarly, [a, b) denotes an interval that includes a but excludes b. For the most part, however, there will be little need to dwell on such refinements as whether the end points of an interval are included or not. The M × M identity matrix (also known as the unit matrix) is denoted IM, which will be abbreviated to I where there is no ambiguity about it dimensionality. It has elements Iij that equal 1 if i = j and 0 if i ̸= j. A functional is denoted f[y] where y(x) is some function. The concept of a functional is discussed in Appendix D. The notation g(x) = O(f(x)) denotes that |f(x)/g(x)| is bounded as x → ∞. For instance if g(x)=3x2 + 2, then g(x) = O(x2). The expectation of a function f(x, y) with respect to a random variable x is denoted by Ex[f(x, y)]. In situations where there is no ambiguity as to which variable is being averaged over, this will be simplified by omitting the suffix, for instance xi

xiv CONTENTS 2 Probability Distributions 67 2.1 Binary Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.1.1 The beta distribution . . . . . . . . . . . . . . . . . . . . . 71 2.2 Multinomial Variables . . . . . . . . . . . . . . . . . . . . . . . . 74 2.2.1 The Dirichlet distribution . . . . . . . . . . . . . . . . . . . 76 2.3 The Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . 78 2.3.1 Conditional Gaussian distributions . . . . . . . . . . . . . . 85 2.3.2 Marginal Gaussian distributions . . . . . . . . . . . . . . . 88 2.3.3 Bayes’ theorem for Gaussian variables . . . . . . . . . . . . 90 2.3.4 Maximum likelihood for the Gaussian . . . . . . . . . . . . 93 2.3.5 Sequential estimation . . . . . . . . . . . . . . . . . . . . . 94 2.3.6 Bayesian inference for the Gaussian . . . . . . . . . . . . . 97 2.3.7 Student’s t-distribution . . . . . . . . . . . . . . . . . . . . 102 2.3.8 Periodic variables . . . . . . . . . . . . . . . . . . . . . . . 105 2.3.9 Mixtures of Gaussians . . . . . . . . . . . . . . . . . . . . 110 2.4 The Exponential Family . . . . . . . . . . . . . . . . . . . . . . . 113 2.4.1 Maximum likelihood and sufficient statistics . . . . . . . . 116 2.4.2 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . 117 2.4.3 Noninformative priors . . . . . . . . . . . . . . . . . . . . 117 2.5 Nonparametric Methods . . . . . . . . . . . . . . . . . . . . . . . 120 2.5.1 Kernel density estimators . . . . . . . . . . . . . . . . . . . 122 2.5.2 Nearest-neighbour methods . . . . . . . . . . . . . . . . . 124 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 3 Linear Models for Regression 137 3.1 Linear Basis Function Models . . . . . . . . . . . . . . . . . . . . 138 3.1.1 Maximum likelihood and least squares . . . . . . . . . . . . 140 3.1.2 Geometry of least squares . . . . . . . . . . . . . . . . . . 143 3.1.3 Sequential learning . . . . . . . . . . . . . . . . . . . . . . 143 3.1.4 Regularized least squares . . . . . . . . . . . . . . . . . . . 144 3.1.5 Multiple outputs . . . . . . . . . . . . . . . . . . . . . . . 146 3.2 The Bias-Variance Decomposition . . . . . . . . . . . . . . . . . . 147 3.3 Bayesian Linear Regression . . . . . . . . . . . . . . . . . . . . . 152 3.3.1 Parameter distribution . . . . . . . . . . . . . . . . . . . . 152 3.3.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 156 3.3.3 Equivalent kernel . . . . . . . . . . . . . . . . . . . . . . . 159 3.4 Bayesian Model Comparison . . . . . . . . . . . . . . . . . . . . . 161 3.5 The Evidence Approximation . . . . . . . . . . . . . . . . . . . . 165 3.5.1 Evaluation of the evidence function . . . . . . . . . . . . . 166 3.5.2 Maximizing the evidence function . . . . . . . . . . . . . . 168 3.5.3 Effective number of parameters . . . . . . . . . . . . . . . 170 3.6 Limitations of Fixed Basis Functions . . . . . . . . . . . . . . . . 172 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173

CONTENTS xv 4 Linear Models for Classification 179 4.1 Discriminant Functions . . . . . . . . . . . . . . . . . . . . . . . . 181 4.1.1 Two classes . . . . . . . . . . . . . . . . . . . . . . . . . . 181 4.1.2 Multiple classes . . . . . . . . . . . . . . . . . . . . . . . . 182 4.1.3 Least squares for classification . . . . . . . . . . . . . . . . 184 4.1.4 Fisher’s linear discriminant . . . . . . . . . . . . . . . . . . 186 4.1.5 Relation to least squares . . . . . . . . . . . . . . . . . . . 189 4.1.6 Fisher’s discriminant for multiple classes . . . . . . . . . . 191 4.1.7 The perceptron algorithm . . . . . . . . . . . . . . . . . . . 192 4.2 Probabilistic Generative Models . . . . . . . . . . . . . . . . . . . 196 4.2.1 Continuous inputs . . . . . . . . . . . . . . . . . . . . . . 198 4.2.2 Maximum likelihood solution . . . . . . . . . . . . . . . . 200 4.2.3 Discrete features . . . . . . . . . . . . . . . . . . . . . . . 202 4.2.4 Exponential family . . . . . . . . . . . . . . . . . . . . . . 202 4.3 Probabilistic Discriminative Models . . . . . . . . . . . . . . . . . 203 4.3.1 Fixed basis functions . . . . . . . . . . . . . . . . . . . . . 204 4.3.2 Logistic regression . . . . . . . . . . . . . . . . . . . . . . 205 4.3.3 Iterative reweighted least squares . . . . . . . . . . . . . . 207 4.3.4 Multiclass logistic regression . . . . . . . . . . . . . . . . . 209 4.3.5 Probit regression . . . . . . . . . . . . . . . . . . . . . . . 210 4.3.6 Canonical link functions . . . . . . . . . . . . . . . . . . . 212 4.4 The Laplace Approximation . . . . . . . . . . . . . . . . . . . . . 213 4.4.1 Model comparison and BIC . . . . . . . . . . . . . . . . . 216 4.5 Bayesian Logistic Regression . . . . . . . . . . . . . . . . . . . . 217 4.5.1 Laplace approximation . . . . . . . . . . . . . . . . . . . . 217 4.5.2 Predictive distribution . . . . . . . . . . . . . . . . . . . . 218 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 5 Neural Networks 225 5.1 Feed-forward Network Functions . . . . . . . . . . . . . . . . . . 227 5.1.1 Weight-space symmetries . . . . . . . . . . . . . . . . . . 231 5.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 5.2.1 Parameter optimization . . . . . . . . . . . . . . . . . . . . 236 5.2.2 Local quadratic approximation . . . . . . . . . . . . . . . . 237 5.2.3 Use of gradient information . . . . . . . . . . . . . . . . . 239 5.2.4 Gradient descent optimization . . . . . . . . . . . . . . . . 240 5.3 Error Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . 241 5.3.1 Evaluation of error-function derivatives . . . . . . . . . . . 242 5.3.2 A simple example . . . . . . . . . . . . . . . . . . . . . . 245 5.3.3 Efficiency of backpropagation . . . . . . . . . . . . . . . . 246 5.3.4 The Jacobian matrix . . . . . . . . . . . . . . . . . . . . . 247 5.4 The Hessian Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . 249 5.4.1 Diagonal approximation . . . . . . . . . . . . . . . . . . . 250 5.4.2 Outer product approximation . . . . . . . . . . . . . . . . . 251 5.4.3 Inverse Hessian . . . . . . . . . . . . . . . . . . . . . . . . 252

点击进入文档下载页（PDF格式）

共700页，可试读40页，点击继续阅读 ↓↓

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录