IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.6.NO.4.JULY 1995 837 Learning in Linear Neural Networks:A Survey Pierre F.Baldi and Kurt Hornik,Member,IEEE Abstract-Networks of linear units are the simplest kind of n output units networks,where the basic questions related to learning,gen- eralization,and self-organization can sometimes be answered analytically.We survey most of the known results on linear net- p hidden units works,including:1)backpropagation learning and the structure of the error function landscape,2)the temporal evolution of generalization,and 3)unsupervised learning algorithms and their n input units properties.The connections to classical statistical ideas,such as principal component analysis (PCA),are emphasized as well as several simple but challenging open questions.A few new results Fig.I.The basic network in the autoassociative case (m n). are also spread across the paper,including an analysis of the effect of noise on backpropagation networks and a unified view of all unsupervised algorithms. in its linear regime.Even when training is completed,one often finds several units in the network which are operating in their linear range.From the standpoint of theoretical biology, I.INTRODUCTION it has been argued that certain classes of neurons may be HIS paper addresses the problems of supervised and operating most of the time in a linear or quasi-linear regime unsupervised learning in layered networks of linear units and linear input-output relations seem to hold for certain and,together with a few new results.reviews most of the specific biological circuits (see [2]for an example).Finally, recent literature on the subject.One may expect the topic to the study of linear networks leads to new interesting questions, be fairly restricted,yet it is in fact quite rich and far from being insights,and paradigms which could not have been guessed exhausted.Since the first approximations of biological neurons in advance and to new ways of looking at certain classical using threshold gates [1],the nonlinear aspects of neural statistical techniques. computations and hardware have often been emphasized and To begin with,we shall consider a linear network with an linear networks dismissed as uninteresting for being able to n-p-m architecture comprising one input layer,one hidden express linear input-output maps only.Furthermore,multiple layer,and one output layer with n,p,and m units,respectively layers of linear units can always be collapsed by multiplying (Fig.I).The more general case,with,for instance,multiple the corresponding weight matrices.So why bother?Non- hidden layers,can be reduced to this simple setting as we linear computations are obviously extremely important,but shall see.A will usually denote the pxn matrix connecting the these arguments should be considered as very suspicious; input to the middle layer and B the m x p matrix of connection by stressing the input-output relations only,they miss the weights from the middle layer to the output.Thus,for instance, subtle problems of dynamics,structure,and organization that 6:;represents the strength of the coupling between the jth normally arise during learning and plasticity,even in simple hidden unit and the ith output unit(double indexes are always linear systems.There are other reasons why linear networks in the post-presynaptic order).The network therefore computes deserve careful attention.General results in the nonlinear case the linear function =BAx.In the usual learning from are often absent or difficult to derive analytically,whereas examples setting,we assume that a set of n-dimensional input the linear case can often be analyzed in mathematical detail patterns r(1<tT)is given together with a corresponding As in the theory of differential equations,the linear setting set of m-dimensional target output patterns yt(1 <t<T)(all should be regarded as the first simple case to be studied.More vectors are assumed to be column vectors).X=[1,.... complex situations can often be investigated by linearization andY=lh,…,r]are the n×T and m×T matrices although this has not been attempted systematically in neural having the patterns as their columns.Because of the need networks,for instance in the analysis of backpropagation for target outputs,this form of learning will also be called learning.In backpropagation,learning is often started with supervised.For simplicity,unless otherwise stated,all the zero or small random initial weights and biases.Thus,at least patterns are assumed to be centered (i.e.,(x)=(y)=0).The during the initial phase of training,the network is operating symbol“(-〉'”will be used for averages over the set of patterns or sometimes over the pattern distribution,depending on the Manuscript received September 25,1992;revised June 19,1994.This work context.The approximation of one by the other is a central was supported in part by grants from NSF,AFOSR,and ONR. problem in statistics,but is not our main concern here.The P.F.Baldi is with the Jet Propulsion Laboratory,and Division of Biology, environment is supposed to be stationary but the results could California Institute of Technology,Pasadena,CA 91109 USA. K.Hornik is with the Institut fur Statistik und Wahrscheinlichkeitstheorie. be extended to a slowly varying environment to deal with Technische Universitit Wicn.A-1040 Vienna,Austria. plasticity issues.Throughout this paper,learning will often be IEEE Log Number 9409158. based on the minimization of an error function E depending 1045-9227/95$04.00@1995IEEE
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 837 Learning in Linear Neural Networks: A Survey Pierre F. Baldi and Kurt Homik, Member, IEEE Absfract- Networks of linear units are the simplest kind of networks, where the basic questions related to learning, generalization, and self-organization can sometimes be answered analytically. We survey most of the known results on linear networks, including: 1) backpropagation learning and the structure of the error function landscape, 2) the temporal evolution of generalization, and 3) unsupervised learning algorithms and their properties. The connections to classical statistical ideas, such as principal component analysis (PCA), are emphasized as well as several simple but challenging open questions. A few new results are also spread across the paper, including an analysis of the effect of noise on backpropagation networks and a unified view of all unsupervised algorithms. I. INTRODUCTION HIS paper addresses the problems of supervised and T unsupervised learning in layered networks of linear units and, together with a few new results, reviews most of the recent literature on the subject. One may expect the topic to be fairly restricted, yet it is in fact quite rich and far from being exhausted. Since the first approximations of biological neurons using threshold gates [ll, the nonlinear aspects of neural computations and hardware have often been emphasized and linear networks dismissed as uninteresting for being able to express linear input-output maps only. Furthermore, multiple layers of linear units can always be collapsed by multiplying the corresponding weight matrices. So why bother? Nonlinear computations are obviously extremely important, but these arguments should be considered as very suspicious; by stressing the input-output relations only, they miss the subtle problems of dynamics, structure, and organization that normally arise during learning and plasticity, even in simple linear systems. There are other reasons why linear networks deserve careful attention. General results in the nonlinear case are often absent or difficult to derive analytically, whereas the linear case can often be analyzed in mathematical detail. As in the theory of differential equations, the linear setting should be regarded as the first simple case to be studied. More complex situations can often be investigated by linearization, although this has not been attempted systematically in neural networks, for instance in the analysis of backpropagation learning. In backpropagation, learning is often started with zero or small random initial weights and biases. Thus, at least during the initial phase of training, the network is operating Manuscript received September 25, 1992; revised June 19, 1994. This work P. F. Baldi is with the Jet Propulsion Laboratory, and Division of Biology, K. Homik is with the Institut fiir Statistik und Wahrscheinlichkeitstheorie, IEEE Log Number 9409158. was supported in part by grants from NSF, AFOSR, and ONR. California Institute of Technology, Pasadena, CA 91 109 USA. Technische Universittit Wien, A-1040 Vienna, Austria. B A n output units p hidden units n input units Fig. 1. The basic network in the autoassociative case (m = n). in its linear regime. Even when training is completed, one often finds several units in the network which are operating in their linear range. From the standpoint of theoretical biology, it has been argued that certain classes of neurons may be operating most of the time in a linear or quasi-linear regime and linear input-output relations seem to hold for certain specific biological circuits (see [2] for an example). Finally, the study of linear networks leads to new interesting questions, insights, and paradigms which could not have been guessed in advance and to new ways of looking at certain classical statistical techniques. To begin with, we shall consider a linear network with an n-pm architecture comprising one input layer, one hidden layer, and one output layer with n, p, and m units, respectively (Fig. 1). The more general case, with, for instance, multiple hidden layers, can be reduced to this simple setting as we shall see. A will usually denote the p x n matrix connecting the input to the middle layer and B the m x p matrix of connection weights from the middle layer to the output. Thus, for instance, bij represents the strength of the coupling between the jth hidden unit and the ith output unit (double indexes are always in the post-presynaptic order). The network therefore computes the linear function y = BAT. In the usual learning from examples setting, we assume that a set of n-dimensional input patterns xt (1 5 t 5 T) is given together with a corresponding set of m-dimensional target output patterns yt (1 5 t 5 T) (all vectors are assumed to be column vectors). X = [XI, . . . , ZT] and Y = [yl,...,y~] are the n x T and m x T matrices having the patterns as their columns. Because of the need for target outputs, this form of learning will also be called supervised. For simplicity, unless otherwise stated, all the patterns are assumed to be centered (i.e., (z) = (y) = 0). The symbol ‘‘(.)” will be used for averages over the set of patterns or sometimes over the pattern distribution, depending on the context. The approximation of one by the other is a central problem in statistics, but is not our main concern here. The environment is supposed to be stationary but the results could be extended to a slowly varying environment to deal with plasticity issues. Throughout this paper, learning will often be based on the minimization of an error function E depending 1045-9227/95$04.00 0 1995 IEEE
838 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL 6.NO.4.JULY 1995 on the synaptic weights.In the main case of backpropagation II MATHEMATICAL BACKGROUND the error function is A.Optimization of Quadratic Forms over Spheres E(A,B)=(iy-BAxI〉 (1)Let S be a symmetric n x n matrix.Then all the eigenvalues λof S are real and can be ordered in the formλi≥ A22·≥λwith corresponding normalized eigenvectors whereu represents the Euclidean norm of the vector u. ,,un Consider the problem of maximizing the quadratic When no target outputs are provided,the learning (which form E(a)=a'Sa over the sphere of radius p and centered at then must be based on criteria to be specified,such as the the origin (a<p).In geometry,it is well known (see,for maximization of the output variance)is unsupervised.An instance,[3])that the maximum of E is then reached on the important special case of unsupervised learning is the case surface of the sphere in the direction of the first eigenvector, of autoassociation,when the input is used as a teacher (i.e., that is,at the points tpun where E(tpu)=Aip2.If 1>A2, vt=).This is also called autoencoding or identity mapping Epui are the only two solutions.Similarly,the maximum of in the literature E over the intersection of the sphere with the linear space Learning rules are algorithms for slowly altering the con- orthogonal to u is reached at +pu2,and so forth.Finally,the nection weights to achieve a desirable goal such as the minimum of E over the entire sphere is obtained at tpun. minimization of an error function.Often,three different ver- All these properties are easily derived by decomposing a as sions of the same rule have been given:the "on-line"version a=∑:a;:and noticing that E(a)=∑,入:a where the modification is calculated after the presentation of each pattern.the "off-line"version where the previous modifications are averaged over the cycle of all patterns,and B.Singular Value Decomposition the "continuous"version where the discrete changes induced Let 2 be an arbitrary k x matrix with rank r.Then there by the "off-line"algorithm are approximated continuously by exist numbers o1≥,,≥ar>0,the singular values of Z.an a differential equation governing the evolution of the weights orthogonal k x k matrix U,and an orthogonal Ix l matrix V in time.In some cases,the three formulations can be shown such that S=U'ZV is a :x I diagonal matrix of the form to lead to essentially the same results. It will be convenient to use the notation =(uv')where S= D O 00 the prime denotes transposition of matrices.If both (u)and (v)are zero,is the covariance matrix of u and v.for where D diag(1,..,o)is the diagonal matrix with instance,is a real n x n symmetric nonnegative definite matrix. entries o1,...,o.The decomposition Hence,its eigenvalues can be ordered as,≥·≥入n≥O. For mathematical simplicity,we shall often assume that in Z=USV fact A>·>入n>0.This should not be regarded as a very restrictive assumption,since this condition can always is called the singular value decomposition (SVD)of Z (it is be enforced by,at worst,perturbing the data by infinitesimal not necessarily unique).The matrices U and V in the SVD amounts and attributing these perturbations to "noise."Many have the following meaning.As Z'ZV=VS'U'USVV conclusions are only slightly different when some eigenvalues VS'S Vdiag(a,...,,0,...,0),the columns of V are coincide. unit-length,mutually perpendicular eigenvectors of Z'Z,and A good familiarity with linear algebra and basic calculus ,..,o2 are the nonzero eigenvalues of Z'Z.Similarly,the on the part of the reader should be sufficient to follow the columns of U are unit-length,mutually perpendicular eigen- paper.All the statistical techniques required to understand vectors of Z2'.With the aid of the SVD.the pseudoinverse some of the results are briefly reviewed in the second section of 2 can easily be given explicitly.If we write These include least-squares regression,principal component analysis,and discriminant analysis.In Section III,we treat S+= 「D-1O the case of supervised learning with backpropagation and 00 the corresponding autoassociative special case.We study the then +=VS+U is the pseudoinverse of Z (see,for landscape of the error function E of (1),its connections to instance.[4]and [5]for more details on SVD and pseudoin- the previously mentioned statistical techniques,and several verses). consequences and generalizations,including noisy and deep networks.In Section IV,we study the problems of validation generalization,and overfitting in a simple one-layer network C.Orthogonal Projections trained to learn the identity map.Under some assumptions,we If C is a linear subspace,we denote by Per the orthogonal give a complete description of the evolution of the validation projection of a vector x onto C and by Per Qcr error as a function of training time.Section V covers a variety -Pcx its projection onto the orthogonal complement of C. of unsupervised learning algorithms,based on variance max- If C has dimension and is spanned by the linearly independent imization/minimization by Hebbian or anti-Hebbian learning vectors 21,..,2,then Pex Pzx,where Z=[21,, or other error functions.Some of the more technical proofs and Pz Z(Z'Z)-Z'.In particular,if the vectors 2;are are deferred to the Appendix. mutually perpendicular unit vectors,the projection of simply
838 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 on the synaptic weights. In the main case of backpropagation, the error function is where llull represents the Euclidean norm of the vector U. When no target outputs are provided, the learning (which then must be based on criteria to be specified, such as the maximization of the output variance) is unsupervised. An important special case of unsupervised learning is the case of autoassociation, when the input is used as a teacher (i.e., yt = zt). This is also called autoencoding or identity mapping in the literature. Learning rules are algorithms for slowly altering the connection weights to achieve a desirable goal such as the minimization of an error function. Often, three different versions of the same rule have been given: the “on-line” version where the modification is calculated after the presentation of each pattern, the “off-line” version where the previous modifications are averaged over the cycle of all patterns, and the “continuous” version where the discrete changes induced by the “off-line” algorithm are approximated continuously by a differential equation governing the evolution of the weights in time. In some cases, the three formulations can be shown to lead to essentially the same results. It will be convenient to use the notation E,, = (uii’) where the prime denotes transposition of matrices. If both (U) and (U) are zero, E,,, is the covariance matrix of U and U. E,,, for instance, is a real n x n symmetric nonnegative definite matrix. Hence, its eigenvalues can be ordered as A1 2 . . . 2 A, 2 0. For mathematical simplicity, we shall often assume that in fact A1 > ... >A, > 0. This should not be regarded as a very restrictive assumption, since this condition can always be enforced by, at worst, perturbing the data by infinitesimal amounts and attributing these perturbations to “noise.” Many conclusions are only slightly different when some eigenvalues coincide. A good familiarity with linear algebra and basic calculus on the part of the reader should be sufficient to follow the paper. All the statistical techniques required to understand some of the results are briefly reviewed in the second section. These include least-squares regression, principal component analysis, and discriminant analysis. In Section 111, we treat the case of supervised learning with backpropagation and the corresponding autoassociative special case. We study the landscape of the error function E of (I), its connections to the previously mentioned statistical techniques, and several consequences and generalizations, including noisy and deep networks. In Section IV, we study the problems of validation, generalization, and overfitting in a simple one-layer network trained to learn the identity map. Under some assumptions, we give a complete description of the evolution of the validation error as a function of training time. Section V covers a variety of unsupervised learning algorithms, based on variance maximizatiodminimization by Hebbian or anti-Hebbian learning or other error functions. Some of the more technical proofs are deferred to the Appendix. 11. MATHEMATICAL BACKGROUND A. Optimization of Quadratic Forms over Spheres Let S be a symmetric n x n matrix. Then all the eigenvalues A, of S are real and can be ordered in the form A1 2 A2 2 . . . 2 A, with corresponding normalized eigenvectors u1, . . . , U,. Consider the problem of maximizing the quadratic form E(a) = n’Sa over the sphere of radius p and centered at the origin (Ilall 5 p). In geometry, it is well known (see, for instance, [3]) that the maximum of E is then reached on the surface of the sphere in the direction of the first eigenvector, that is, at the points fpul where E(*pul) = Alp2. If A1 > A2, &pul are the only two solutions. Similarly, the maximum of E over the intersection of the sphere with the linear space orthogonal to u1 is reached at *pu2, and so forth. Finally, the minimum of E over the entire sphere is obtained at fpu,. All these properties are easily derived by decomposing a as n = C,Q~U, and noticing that E(a) = E, A,cy:. B. Singular Value Decomposition Let 2 be an arbitrary IC x 1 matrix with rank r. Then there exist numbers (TI 2 . . . 2 (T, > 0, the singular values of 2, an orthogonal k x IC matrix U, and an orthogonal 1 x 1 matrix V such that S = U’ZV is a IC x 1 diagonal matrix of the form where D = diag(o1, . . . , or) is the diagonal matrix with entries 01, . . . , U,. . The decomposition 2 = usv‘ is called the singular value decomposition (SVD) of 2 (it is not necessarily unique). The matrices U and V in the SVD have the following meaning. As Z‘ZV = VS‘U‘USV‘V = VS’S = Vdiag(af,. . . , oz, 0,. . . ,0), the columns of V are unit-length, mutually perpendicular eigenvectors of 2’2, and (T:, . . . ,oz are the nonzero eigenvalues of 2’2. Similarly, the columns of U are unit-length, mutually perpendicular eigenvectors of 22’. With the aid of the SVD, the pseudoinverse of 2 can easily be given explicitly. If we write L J then Zi‘ = VS+U‘ is the pseudoinverse of Z (see, for instance, [4] and [5] for more details on SVD and pseudoinverses). C. Orthogonal Projections If L is a linear subspace, we denote by Pcx the orthogonal projection of a vector z onto C and by PClx = Qcx = x - Pcx its projection onto the orthogonal complement of C. If C has dimension 1 and is spanned by the linearly independent vectors 21, . . . ,zl, then PLZ = PZZ, where 2 = [z1 , . . . , 211 and Pz = Z(Z’Z)-lZ’. In particular, if the vectors z, are mutually perpendicular unit vectors, the projection of x simply
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 39 is ZZ'x =z+..+z.If the matrix Z is not full case it is sufficient to notice that E can be rewritten as rank,Pz can still be written as E(A)=llvec(Y)-(X')vec(A)12/T;where denotes the P2=ZZ+=w1+…+ur Kronecker product of two matrices and "vec"is an operator which transforms a matrix into a vector by stacking its columns where u,..,u are the first r columns of the U matrix in one above the other.See [6]for details.)In particular,even the SVD of 2 (and r is the rank of 2). if the connectivity between the input and the output is local Consider now the problem of finding a vector w which the problem remains convex and without local minima and minimizes F(w)=c-Mw2 for a given vector c and a therefore,in principle,easy to learn by a gradient-descent matrix M.In other words,we are looking for the vector in type of mechanism.Finally.notice that if for an input r we the image space of M(the space spanned by the columns of approximate the corresponding output pattern y by its linear M)which is closest to c.Clearly,by the projection theorem estimate yt =vzt,then the covariance matrix of the this is the orthogonal projection of c onto the image of M. estimates is given by∑=(j)=∑yr2r∑ry In particular,if M is of full rank,then at the optimum w we must have M(M'M)-1M'c Mw or,equivalently, w=(M'M)-1M'e.The Hessian of F is 2M'M;hence,if E.Principal Comonent Analysis M is full rank,the Hessian is positive definite and the problem Suppose we are given a collection of T objects.For each is strictly convex without any other critical points. object xt,the measurements of the same n characteristics 1.t,,n.t are available.Assume it is desired to extract D.Least-Squares Regression some“structure'or“main features'”from this collection of The problem of linear regression is the following.Given a data.For efficient classification,it is obviously useful to set of n-dimensional input vectors 1,...,r and a set of m- compress the high-dimensional input data into something low dimensional target vectors y,,yr,find an m x n matrix A dimensional without discarding too much relevant information. which minimizes E(A)=(lly-Ax2).In other words,linear Of course,there are several different techniques for feature regression is exactly the usual learning problem in a linear extraction and data compression.One of the simplest and most network without any hidden units.Since the output units are general-purpose ones is a statistical method known as principal completely uncoupled,the connection weights for each of them component analysis (PCA). can be synthesized separately and therefore one needs only to By possibly subtracting the average ()we can think of consider the case m =1,where we write A =a'.In this the data set ri,t(l≤i≤n,l≤t≤T)as a cloud of case,the problem has a simple geometrical interpretation:find T points in n-dimensional Euclidean space centered around a hyperplane through the origin in (n +1)-dimensional space the origin.To capture the main features of the data set, which best fits(in the least-squares sense)a cloud of T points PCA is looking for directions along which the dispersion or with coordinates(x,1',·,(xr,r))'.Now variance of the point cloud is maximal,that is,looking for a subspace C such that the projection of the points r onto E(a)=(y-a'x)2〉=a∑xra-2Evxa+(y2) C has maximal variance.If C is the line spanned by the and the gradient of E with respect to a is unit vector a,the projection Pcr is given by Par=aa'r with squared length Par2 =(a'r)2=a'xz'a.Hence, VE=2∑rxa-2∑xy the average dispersion of the data set in the direction of the line is (lPax2)=(a'xx'a)a'(r')a a'Erza,where E is continuous,differentiable,and bounded below by zero =(r')is the data covariance matrix.PCA looks for a and therefore it must reach its minimum for a vector a unit vector a*which maximizes a'a over the set of all satisfying =ry If is positive definite,then there unit vectors.IfXi>·>入n>0 are the eigenvalues of∑rr is a unique solution given by with eigenvectors u,.,n,then,by the previous result on a=∑xw (2) quadratic forms in Section II-A,we know that a=u (or equivalently-u1)is the answer. and,in addition,E is strictly convex (with Hessian 2) To sum up,PCA starts by finding the direction in which the and so without any local minima (or even without any other dispersion of the cloud is maximal,which is the direction u critical point).The landscape is therefore as simple as possible, of the first eigenvector of the data covariance matrix.The first and this remains true even if some of the connections are "feature"which is extracted is the first principal component forced in advance to take some fixed values,typically zero in uit.The component of the data "explained"by the first the case of "local"connectivity (this introduces linear,thus principal component is the projection onto the line spanned convex,restrictions on the set of possible weights).When by u.What remains unexplained is the dispersion of the m>1,everything goes through mutatis mutandis.In the case residual re-Purt which is just the projection of t where is positive definite,the unique optimal A is called onto the orthogonal complement of u.In a second step,we the slope matrix of the regression of y on r and is given by proceed as before,but with the points r replaced by t. A=y∑ That is we look for straight lines C perpendicular to the line spanned by u such that the projections of the points t which generalizes (2),taking into account that A a'have maximal variance.This amounts to finding a unit vector in one dimension.(Formally,to reduce the m-dimensional b",perpendicular to,which maximizesbover all unit
BALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 839 is ZZ’x = zlzix + ... + Z&X. If the matrix 2 is not full rank, Pz can still be written as Pz = zz+ = U,,; + ’. ’ + u,u; where ~1, . . . , U, are the first T columns of the U matrix in the SVD of 2 (and T is the rank of 2). Consider now the problem of finding a vector w which minimizes F(w) = [IC - Mw1I2 for a given vector c and a matrix M. In other words, we are looking for the vector in the image space of M (the space spanned by the columns of M) which is closest to e. Clearly, by the projection theorem, this is the orthogonal projection of c onto the image of M. In particular, if M is of full rank, then at the optimum w we must have M(M’M)-lM’c = MUJ or, equivalently, w = (M’M)-lM’c. The Hessian of F is 2M’M; hence, if M is full rank, the Hessian is positive definite and the problem is strictly convex without any other critical points. D. Least-Squares Regression The problem of linear regression is the following. Given a set of n-dimensional input vectors x1 , . . . , XT and a set of mdimensional target vectors y1, . . . , yT, find an m x n matrix A which minimizes E(A) = (lly - Axil2). In other words, linear regression is exactly the usual learning problem in a linear network without any hidden units. Since the output units are completely uncoupled, the connection weights for each of them can be synthesized separately and therefore one needs only to consider the case m = 1, where we write A = a’. In this case, the problem has a simple geometrical interpretation: find a hyperplane through the origin in (n + 1)-dimensional space which best fits (in the least-squares sense) a cloud of T points with coordinates (xi, ~1)’. . . . , (xk, y~)’. Now E(a) = ((y - a’x)2) = u’C,,a - 2Cy,a + (y2) and the gradient of E with respect to a is VE = 2C,,a - 2C,,. E is continuous, differentiable, and bounded below by zero and therefore it must reach its minimum for a vector a satisfying Czxa = CZy. If E,, is positive definite, then there is a unique solution given by a = (2) and, in addition, E is strictly convex (with Hessian 2C,,) and so without any local minima (or even without any other critical point). The landscape is therefore as simple as possible, and this remains true even if some of the connections are forced in advance to take some fixed values, typically zero in the case of “local” connectivity (this introduces linear, thus convex, restrictions on the set of possible weights). When m > 1, everything goes through mutatis mutandis. In the case where E,, is positive definite, the unique optimal A is called the slope matrix of the regression of y on z and is given by A = Ey,CL2 which generalizes (2), taking into account that A = a’ in one dimension. (Formally, to reduce the m-dimensional case it is sufficient to notice that E can be rewritten as E(A) = Ilvec(Y)-(X’@I)vec(A)112/T, where 63 denotes the Kronecker product of two matrices and “vec” is an operator which transforms a matrix into a vector by stacking its columns one above the other. See [6] for details.) In particular, even if the connectivity between the input and the output is local, the problem remains convex and without local minima and therefore, in principle, easy to learn by a gradient-descent type of mechanism. Finally, notice that if for an input xt we approximate the corresponding output pattern yt by its linear estimate Gt = CyzC;:xt, then the covariance matrix of the estimates is given by C = ($6’) = Cy,C;2EZy. E. Principal Comonent Analysis Suppose we are given a collection of T objects. For each object xt, the measurements of the same n characteristics xl,t, * . . , x,,t are available. Assume it is desired to extract some “structure” or “main features” from this collection of data. For efficient classification, it is obviously useful to compress the high-dimensional input data into something low dimensional without discarding too much relevant information. Of course, there are several different techniques for feature extraction and data compression. One of the simplest and most general-purpose ones is a statistical method known as principal component analysis (PCA). By possibly subtracting the average (x), we can think of the data set x+(l 5 i 5 n, 1 5 t 5 T) as a cloud of T points in n-dimensional Euclidean space centered around the origin. To capture the main features of the data set, PCA is looking for directions along which the dispersion or variance of the point cloud is maximal, that is, looking for a subspace C such that the projection of the points zt onto C has maximal variance. If C is the line spanned by the unit vector a, the projection PL~ is given by Pax = adz with squared length llP,~11~ = (a’x)’ = a’zx’a. Hence, the average dispersion of the data set in the direction of the line is (llPaxl12) = (a’xx’a) = a’(xx’)a = a’C,,a, where E,, = (xx’) is the data covariance matrix. PCA looks for a unit vector a* which maximizes a’C,,a over the set of all unit vectors. If A1 > . . . > A, > 0 are the eigenvalues of C,, with eigenvectors ul, . . . , U,, then, by the previous result on quadratic forms in Section 11-A, we know that a* = u1 (or equivalently -ul) is the answer. To sum up, PCA starts by finding the direction in which the dispersion of the cloud is maximal, which is the direction 7L1 of the first eigenvector of the data covariance matrix. The first “feature” which is extracted is the first principal component uixt. The component of the data “explained” by the first principal component is the projection onto the line spanned by ~1. What remains unexplained is the dispersion of the residual xt - Pu,xt which is just the projection Qulxt of xt onto the orthogonal complement of ~1. In a second step, we proceed as before, but with the points xt replaced by Qulxt. That is we look for straight lines C perpendicular to the line spanned by u1 such that the projections of the points Qulxt have maximal variance. This amounts to finding a unit vector b*, perpendicular to ~1, which maximizes b’C,,b over all unit
840 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL.6.NO.4.JULY 1995 vectors perpendicular to u.Again,by the previous result,we of the Euclidean space,such that in the new system of know the answer isb*=+u2,and so forth.At the kth step, coordinates,the components of the points in the cloud are we look for lines C perpendicular to the space spanned by uncorrelated and with decreasing variance.Again,if only the ,such that the projections of the points r along first few coordinates in the new system vary significantly,we C&have maximal dispersion.This is achieved by choosing may approximately locate points by giving only these few C&as the line spanned by uk. coordinates. After the completion of p steps,we extract the first p princi- PCA can also be examined from an information-theoretic pal components,and reduce r to its projection standpoint and shown to be optimal,under simple assump- onto the hyperplane spanned by the first p eigenvectors.One tions,for a different measure.More precisely,consider a may be interested in asking whether this is the best possible transmission channel (in our case,one can think of the data reduction of the kind under consideration,that is,the network connecting the input units to the hidden units)with best possible projection of the data onto a p-dimensional n-dimensional centered input vectors having a Gaussian dis- hyperplane in the sense that the projections of the data tribution with covariance matrix E=(zz).The outputs of onto H have maximal variance.After all,a better result might the channel are constrained to be p-dimensional vectors of the have been achieved by choosing the hyperplane in a single form y La,for some p x n matrix L (and,without any step.This,however,is not the case. loss of generality,we can assume that L has rank p,p<n. Among all p-dimensional hyperplanes H,the one spanned Hence,y is also Gaussian with covariance matrix LEL by the first p principal vectors ,.,up is the hyperplane Classically,the differential entropy of is given by (see,for such that (2)is maximal.Equivalently,it is the hy- instance,[7]for more details) perplane H which minimizes the average projection error (x-Pxl2〉. H(x))=- p(x)logp(x)dx=号log[(2re)”det(②rxl It is therefore possible to incrementally build the PCA feature extractor.Since 7 is the best p-dimensional hyperplane where p(r)is the Gaussian density function,and similarly we can fit to the n-dimensional point cloud.the "flatter"'the H(y)=log ((2xe)Pdet(IErL')]. cloud the better the fit.It is worth investigating how good the fit is,that is,how much of the variance in the data set actually The conditional distribution of x given y (see,for instance, is explained by the first p principal components.This is easily [8])is normal with mean computed,for the variance of the ith component is given by Le.v=EruEuiu (Pu,x2)=(ugx)2)=4∑xxu:=(A::=入 and covariance matrix The total variance being equal to the sum of all the eigenvalues z.w=xx-2ry2y∑yr of the proportion of total variance explained by the first p principal components equals (+..+p)/(1+...+n). It can be shown that H(ly),the conditional entropy of x In fact,PCA performs "best data compression"among a given y(i.e.,the entropy of the conditional distribution)is wider class of methods.Let us write Up [u1,..,upl given by for the matrix having the first p normalized eigenvectors of r as its columns and let us stack the first p features H(xy)=2log(2xe)n-P1…m-p) ,,uextracted by PCA into a column vector zt. where y1 2..p>0 are the nonzero eigenvalues of Then=and P=UpU=Upzt.Hence,PCA is As the entropy is one way of measuring our uncertainty, one method that linearly compresses n-dimensional inputs xt it is desirable to choose L so as to minimize H(ry).One can into p-dimensional vectors 2:for some p<n,that is,z =Ax show that the optimal L is of the form L=CU where C is an for a suitable px n matrix A.Linear reconstruction of the data invertible pxp matrix and Up=[,.p.In particular,this can then be achieved by approximating rt by Bz=BAre choice also maximizes the information that y conveys about z for some suitable n x p matrix B. measured by the mutual information I(r,y)defined to be Among all p x n matrices A and n x p matrices B, optimal linear data compression (in the sense that the average I(,)=H()-H(ly) reconstruction error (x-BAr2)is minimized)is achieved with value if and only if the global map W=BA equals the orthogonal projection Pu.onto the hyperplane spanned by the first p IpcA(a,)=麦log(2re)PX1…p): eigenvectors of∑xz Finally,computing the covariance of two principal compo- Thus,at least in the Gaussian setting,up to trivial trans- nents gives that for i formations,the optimal linear map maximizing the mutual information is the principal component analyzer.Finally,PCA (u,x)(ugx)》=(uxx'u)=4(zx)45=入,4=0. can also be connected to optimal inference methods (see [9]). To illustrate the PCA feature extraction technique,con- Thus different components are uncorrelated,and we can think sider the "open/closed book"data set in Mardia et al.[10, of the transformation of into the vector of n principal Table 1.2.1,p.3f].The data consist of the scores of T= components ut,.,'as an orthogonal transformation 88 students on examinations in mechanics,vectors,algebra
840 EEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 vectors perpendicular to u1. Again, by the previous result, we know the answer is b* = 4~~2, and so forth. At the kth step, we look for lines ck perpendicular to the space spanned by U], . . . , U&] such that the projections of the points xt along LI, have maximal dispersion. This is achieved by choosing LI, as the line spanned by uk. After the completion of p steps, we extract the first p principal components u/lxt, . . . , uLxt and reduce xt to its projection onto the hyperplane spanned by the first p eigenvectors. One may be interested in asking whether this is the best possible data reduction of the kind under consideration, that is, the best possible projection of the data onto a p-dimensional hyperplane 3-1 in the sense that the projections of the data onto 3-1 have maximal variance. After all, a better result might have been achieved by choosing the hyperplane in a single step. This, however, is not the case. Among all p-dimensional hyperplanes 3-1, the one spanned by the first p principal vectors 211, . . . , up is the hyperplane such that (IlP~xll~) is maximal. Equivalently, it is the hyperplane 3-1 which minimizes the average projection error It is therefore possible to incrementally build the PCA feature extractor. Since 3-1 is the best p-dimensional hyperplane we can fit to the n-dimensional point cloud, the “flatter” the cloud the better the fit. It is worth investigating how good the fit is, that is, how much of the variance in the data set actually is explained by the first p principal components. This is easily computed, for the variance of the ith component is given by (11. - PxxIl”. The total variance being equal to the sum of all the eigenvalues of E,,, the proportion of total variance explained by the first p principal components equals (X,+...+X,)/(Xl+...+X,) . In fact, PCA performs “best data compression” among a wider class of methods. Let us write U, = [ul,... ,up] for the matrix having the first p normalized eigenvectors of E,, as its columns and let us stack the first p features uixt, . . . , ukx, extracted by PCA into a column vector zt. Then zt = Vizt and Pu,xt = UpUbxt = U,zt. Hence, PCA is one method that linearly compresses n-dimensional inputs xt into p-dimensional vectors zt for some p < n, that is, z = Ax for a suitable p x n matrix A. Linear reconstruction of the data can then be achieved by approximating xt by Bzt = BAxt for some suitable n x p matrix B. Among all p x n matrices A and n x p matrices B, optimal linear data compression (in the sense that the average reconstruction error (llx - BAxl12) is minimized) is achieved if and only if the global map W = BA equals the orthogonal projection Pup onto the hyperplane spanned by the first p eigenvectors of E,,. Finally, computing the covariance of two principal components gives that for i # j ((u:x)(u;z)) = (u:xx’uj) = ‘IL::(22’)Uj = u:xjuj = 0. Thus different components are uncorrelated, and we can think of the transformation of xt into the vector of n principal components [u’,xt, . . . , udxt]’ as an orthogonal transformation of the Euclidean space, such that in the new system of coordinates, the components of the points in the cloud are uncorrelated and with decreasing variance. Again, if only the first few coordinates in the new system vary significantly, we may approximately locate points by giving only these few coordinates. PCA can also be examined from an information-theoretic standpoint and shown to be optimal, under simple assumptions, for a different measure. More precisely, consider a transmission channel (in our case, one can think of the network connecting the input units to the hidden units) with n-dimensional centered input vectors having a Gaussian distribution with covariance matrix E,, = (xx’). The outputs of the channel are constrained to be p-dimensional vectors of the form y = Lx, for some p x n matrix L (and, without any loss of generality, we can assume that L has rank p,p < n. Hence, y is also Gaussian with covariance matrix LCx,L’. Classically, the differential entropy of x is given by (see, for instance, [7] for more details) H(z) = - /p(x) log p(x) dx = log [(are)” det(C,,)] where p(x) is the Gaussian density function, and similarly H(y) = log [(2re)Pdet(LE,,L’)]. The conditional distribution of x given y (see, for instance, [SI) is normal with mean Px.y = Ex,c;;Y Ezz.y = c,, - EX&;pJyz. and covariance matrix It can be shown that H(xly), the conditional entropy of x given y (i.e., the entropy of the conditional distribution) is given by ~(xly) = ;log((2re)”-~yl ...yn-,) where y1 2 . . . 2 yn-, > 0 are the nonzero eigenvalues of Czz.y. As the entropy is one way of measuring our uncertainty, it is desirable to choose L so as to minimize H(xJy). One can show that the optimal L is of the form L = CU; where C is an invertible p xp matrix and U, = [UI, . . . , U,]. In particular, this choice also maximizes the information that y conveys about z measured by the mutual information I(x, y) defined to be I(x, Y) = H(x) - H(xly) ~pCA(x,y) = ilog((2re)~~1 ... A,). Thus, at least in the Gaussian setting, up to trivial transformations, the optimal linear map maximizing the mutual information is the principal component analyzer. Finally, PCA can also be connected to optimal inference methods (see [9]). To illustrate the PCA feature extraction technique, consider the “opedclosed book” data set in Mardia et al. [lo, Table 1.2.1, p. 3fl. The data consist of the scores of T = 88 students on examinations in mechanics, vectors, algebra, with value
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 841 analysis,and statistics (i.en=5,where the first two Upon projecting the patters,the corresponding exams were closed book and the other three were open total and between classes dispersions of the z patterns become book).For each exam,the best possible score was 100. C'∑z,Cand C'∑y2∑gC.A projection is optimal if the It is found that the average score (z)equals (39.0,50.6, between classes variation of the z's is as large as possible 50.6,46.7,42.3)'and that the eigenvalues of the covariance relative to the total variation.Different cost functions can matrix Erz are given by A1 =679.2.A2 199.8,A3 be introduced at this stage.If the size of a variation matrix 102.6.A4 83.7 and As 31.8.Hence,the first two is measured by its determinant (the determinant of a matrix principal components already explain 80%of the variance measures the volume of the image of a unit cube under the in the data (and 91%is achieved with the first three).The corresponding linear map),then we are led to the problem of first two eigenvectors are u=(0.51,0.37,0.35,0.45,20.53)' finding an n x p matrix C maximizing the ratio andu2=(0.75,0.21,-0.08,-0.30,-0.55)'.These findings can easily be interpreted.The authors conclude that "...the det(C'Ew∑d∑grC) E(C)= (3) first principal component gives positive weight to all the det(C'∑rxC) variables and thus represents an average grade.On the other The solution is well known. hand,the second principal component represents a contrast All optimal matrices,the so-called discriminant analysis between the open-book and closed-book examinations..."For (DA)matrices,are of the form HR,where R is an arbitrary example,the scores and first two principal components of the p x p invertible matrix and Hp has the first p eigenvectors of two best students are (77,82,67,67,81).66.4,and 6.4 and (63,78.80,70,81),63.7,and -6.4.Even without looking at ∑z∑zy2 as its columns.. It is not easy to see what the solutions look like in general. the individual test scores,by considering only the first two There is one case,however,where all optimal solutions can principal components one would conclude that the overall easily be described. performances of the two students are very similar,but the first When p=T=rank(∑zy),ann×p matrix C is a student did better on closed book and the second one better DA matrix if and only if the space spanned by the columns on open-book exams. of C coincides with the space spanned by the rows of the In conclusion,PCA is optimal in the least-mean-square mean-square classifier∑yx∑r. sense and can serve two purposes:data compression by See,for instance,Kshirsager [8]for more details on DA projecting high-dimensional data into a lower-dimensiona space and feature extraction by revealing,through the principal components,relevant but unexpected structure hidden in the IⅡ.BACKPROPAGATION data(although an interpretation of these features in terms of the original variables may not always be straightforward). A.The Landscape Properties of E We now consider the setting described in the Introduction where the learning procedure is based on the minimization F.Mean Square Classifier and Discriminant Analysis of the cost function E(A,B).A complete description of the Consider now the problem where the patterns zt must be landscape properties of E is given in Baldi and Hornik [6] classified into m classes C1,...Cm,with,in general,mn. We shall briefly review the most salient features.E is best Thus for every input pattern rt,there is a binary target output described in terms of its critical points,that is,the points where pattern=(0,…,l,…,0)'where,t=1 if and only if E/Oaij E/0bij =0.It is first important to observe belongs to Ci.One possible classification method consists that if C is any p x p invertible matrix,then E(A,B)= in finding an m x n matrix L such that (y-Lx2)is E(CA,BC-1).Therefore,at any point E really depends on minimal.Needless to say,this is a special case of least-squares the global map W BA rather than on A and B.For regression,and,as we have seen,under the usual assumptions instance,there is an infinite family of pairs of matrices (A.B) the optimal L is given byL= and is called the corresponding to any critical point.Unlike the simple case of mean-square classifier linear regression,however,W cannot be chosen arbitrarily: In many applications n is very large compared to m,and the network architecture constrains W to have at most rank p. therefore it becomes useful to first reduce the dimensionality The remarkable property of the landscape of E is the of the input data.One is thus led to find a linear subspace of absence of local minima in spite of the fact that E is not dimension p such that,when projected onto this subspace,the convex (nor is the set of all matrices of rank at most p).E patterns fall as much as possible into well-defined separated is characterized by a unique global minimum (up to multi- clusters facilitating the classification.This problem of finding plication by a matrix C).All other critical points are saddle an optimal projection is similar to the one encountered in points.The structure of the critical points can be described PCA.Because of the clustering,however,a new measure completely.More precisely,assume for simplicity that p must be introduced to compare different projections.Consider m≤n and that∑=∑yz∑r∑z,the covariance matrix a projection z=C'x,where C is an n x p matrix.The of the linear estimates (see Section II-D),is full rank total dispersion (variation)in the z-sample can be decomposed with m distinct eigenvalues 1>>Am and corresponding into the sum of within-class dispersions and between-class orthonormal eigenvectors u1,...,um.If I=fi,.,ip}with dispersions.When the z's are centered,the total dispersion is I≤ii<…<in≤m is any ordered p-index set,let Ur and the dispersion between classes can be shown to be denote the matrix []Then two full-rank matrices
BALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 84 1 analysis, and statistics (i.e., n = 5, where the first two exams were closed book and the other three were open book). For each exam, the best possible score was 100. It is found that the average score (z) equals (39.0, 50.6, 50.6, 46.7, 42.3)’ and that the eigenvalues of the covariance matrix E,, are given by A1 = 679.2, A2 = 199.8, A3 = 102.6, A4 = 83.7 and X5 = 31.8. Hence, the first two principal components already explain 80% of the variance in the data (and 91% is achieved with the first three). The first two eigenvectors are u1 = (0.51,0.37,0.35,0.45,20.53)’ and ‘112 = (0.75,0.21, -0.08, -0.30, -0.55)’. These findings can easily be interpreted. The authors conclude that “. . .the first principal component gives positive weight to all the variables and thus represents an average grade. On the other hand, the second principal component represents a contrast between the open-book and closed-book examinations. . .” For example, the scores and first two principal components of the two best students are (77, 82, 67, 67, 81), 66.4, and 6.4 and (63, 78, 80, 70, 81), 63.7, and -6.4. Even without looking at the individual test scores, by considering only the first two principal components one would conclude that the overall performances of the two students are very similar, but the first student did better on closed book and the second one better on open-book exams. In conclusion, PCA is optimal in the least-mean-square sense and can serve two purposes: data compression by projecting high-dimensional data into a lower-dimensional space and feature extraction by revealing, through the principal components, relevant but unexpected structure hidden in the data (although an interpretation of these features in terms of the original variables may not always be straightforward). F. Mean Square Class$er and Discriminant Analysis Consider now the problem where the patterns xt must be classified into m classes Cl, . . . , C,, with, in general, m << n. Thus for every input pattern iCt, there is a binary target output pattern yt = (0,. . . ,1,. .. ,0)’ where yi,t = 1 if and only if xt belongs to Ci. One possible classification method consists in finding an m x n matrix L such that (11 y - Lz(I2) is minimal. Needless to say, this is a special case of least-squares regression, and, as we have seen, under the usual assumptions the optimal L is given by L = EyzE;2 and is called the mean-square classifier. In many applications n is very large compared to m, and therefore it becomes useful to first reduce the dimensionality of the input data. One is thus led to find a linear subspace of dimension p such that, when projected onto this subspace, the patterns xt fall as much as possible into well-defined separated clusters facilitating the classification. This problem of finding an optimal projection is similar to the one encountered in PCA. Because of the clustering, however, a new measure must be introduced to compare different projections. Consider a projection z = C’x, where C is an n x p matrix. The total dispersion (variation) in the x-sample can be decomposed into the sum of within-class dispersions and between-class dispersions. When the z’s are centered, the total dispersion is E,,, and the dispersion between classes can be shown to be EzYE;; E,,. Upon projecting the patterns, the corresponding total and between classes dispersions of the zt patterns become C’C,,C and C’C,,E;;E,,C. A projection is optimal if the between classes variation of the z’s is as large as possible relative to the total variation. Different cost functions can be introduced at this stage. If the size of a variation matrix is measured by its determinant (the determinant of a matrix measures the volume of the image of a unit cube under the corresponding linear map), then we are led to the problem of finding an n x p matrix C maximizing the ratio E(C) = det(C’C,,C,-,lC,,C) det (C’C,,C) ’ (3) The solution is well known. All optimal matrices, the so-called discriminant analysis (DA) matrices, are of the form H,R, where R is an arbitrary p x p invertible matrix and H, has the first p eigenvectors of E;:EzYE;tE,, as its columns. It is not easy to see what the solutions look like in general. There is one case, however, where all optimal solutions can easily be described. When p = T = rank(&,), an n x p matrix C is a DA matrix if and only if the space spanned by the columns of C coincides with the space spanned by the rows of the mean-square classifier Cy, E;:. See, for instance, Kshirsager [8] for more details on DA. 111. BACKPROPAGATION A. The Landscape Properties of E We now consider the setting described in the Introduction where the learning procedure is based on the minimization of the cost function E(A, B). A complete description of the landscape properties of E is given in Baldi and Hornik [6]. We shall briefly review the most salient features. E is best described in terms of its critical points, that is, the points where dE/daij = dE/dbij = 0. It is first important to observe that if C is any p x p invertible matrix, then E(A,B) = E(CA,BC-l). Therefore, at any point E really depends on the global map W = BA rather than on A and B. For instance, there is an infinite family of pairs of matrices (A, B) corresponding to any critical point. Unlike the simple case of linear regression, however, W cannot be chosen arbitrarily: the network architecture constrains W to have at most rank p. The remarkable property of the landscape of E is the absence of local minima in spite of the fact that E is not convex (nor is the set of all matrices of rank at most p). E is characterized by a unique global minimum (up to multiplication by a matrix C). All other critical points are saddle points. The structure of the critical points can be described completely. More precisely, assume for simplicity that p 5 m 5 n and that C = Ey,E;:Czy, the covariance matrix of the linear estimates st (see Section 11-D), is full rank with m distinct eigenvalues A1 > . . . > A, and corresponding orthonormal eigenvectors u1, . . . , U,. If Z = { 21, . . . , i,} with 1 5 il < . . . < i, 5 m is any ordered p-index set, let UT denote the matrix [uil, . . . , ui,]. Then two full-rank matrices