842 IEEE TRANSACTIONS ON NEURAL NETWORKS,VOL 6,NO.4.JULY 1995 A and B define a critical point of E if and only if there exist techniques without resorting to any descent procedure.As an ordered p-index set I and an invertible p x p matrix C pointed out in the Introduction,however,this is not the most such that relevant point of view here where the emphasis is on the learning behavior and emergent organizational principles of A=CU4Eye∑, (4) simple adaptive networks. B=UC-. (5) One of the central issues in learning from examples is For such a critical point we have the problem of generalization,that is,how does the network perform when exposed to a pattern never seen previously? W PU:EuzEat (6) In this setting.a precise quantitative answer can be given to this question.For instance,in the autoassociative case,the and distortion of a new pattern is given by its distance to the E(A,B)=trace(vy)-Ai. subspace generated by the first p eigenvectors of r ieI In the special case where rank(y)=r=p,Gallinari Therefore,a critical W of rank p is always the product et al.[12]have shown that if an n-p -m architecture is trained to classify n-dimensional inputs into m(m<n) of the ordinary least-squares regression matrix followed by an orthogonal projection onto the subspace spanned by p classes,then the corresponding network performs discriminant eigenvectors of E.The critical map W associated with the analysis in the sense that,for an optimal W =BA,A'is a DA matrix.In other words.under these assumptions,the index set {1,,p}is the unique local and global minimum of E.The remaining (")-1 p-index sets correspond to saddle projection realized by A'maximizes the ratio given in(3).In points.All additional critical points defined by matrices A this context,however,either p =r=m,in which case the and B which are not of full rank are also saddle points and architecture is n-m-m and there is no bottleneck,or r<m can be characterized in terms of orthogonal projections onto and then full classification into m categories is not supported subspaces spanned by q eigenvectors,with g<p. by the available data and there is no proper data compression In the autoassociative case,(4)-(6)become (only filtering out of linear dependencies).In any case,all the network ever learns is to be a mean-square classifier,and this A=CU (7) can be achieved without any hidden layer. B=UTC-1 (8) W=PUz (9) B.Deep Networks,Local Connectivity and therefore the unique locally and globally optimal map W Nonlinearities,and Bias is the orthogonal projection onto the space spanned by the In Baldi [13],the case of deep networks with multiple first p eigenvectors of r hidden layers is briefly examined.It is easy to see that,in This analysis links backpropagation in linear networks to this case,the main constraint on the network comes from several classical statistical techniques.In particular,at the its bottleneck,that is,from the hidden layer with smallest global minimum of E,if C=Ip then the activities in the size p (clearly,p could be attained in more than one hidden hidden layer are given by u,u,the principal com- layer).Although the expression for the critical points may now ponents of the least-squares estimators (see,for instance, become more involved,the main features of the landscape are [8)).In the autoassociative mode,these activities are given unchanged:a multiplicity of saddle points,an absence of local by ut,.t,and correspond to the coordinates of the minima,and a unique optimal input/output map satisfying (6) vector along the first p eigenvectors of as in the usual with I={1,·,p}. PCA.In general,if the initial conditions are random,one The bottleneck layer imposes a rank restriction on the map should not expect the backpropagation algorithm to converge computed by the network.Additional important constraints to an optimum satisfying C=Ip.In the autoassociative case, can be introduced on the geometry of the connections.Often this means that the rows of the final A and u,.,up will connections are assumed to be local,in the sense that a unit span the same space but A',,].Although at first in one layer receives projections only from a restricted subset sight this may seem a drawback,it must be regarded as a of elements in the previous layer,for instance according to a property leading to more robust networks.Indeed,in a physical Gaussian distribution.These geometrical constraints play an implementation where the compressed version of the data in essential role in self-organizing maps and in several models the hidden layer is to be sent to further processing layers, of "linear"cortical development;see for instance Linsker it may not be desirable that one of the units,extracting the [14]-[16]and Miller et al.[17].These topics deserve separate principal component,has a variance much larger than the other treatment and will not be addressed here.As mentioned in the units (it is known,for instance,that in the case of random previous section,however,in the case of a locally connected symmetric matrixes,λ2《λ1 almost always;see[11]).A linear network without any hidden layer the landscape of more balanced strategy,where all the variances in the hidden the usual quadratic error is again completely devoid of local layer are comparable,is by far preferable and is commonly minima.Learning by descent methods should then be efficient. observed in simulations. The landscape properties of the LMS (least mean square)error Since the optimal solution can be expressed analytically, function of a linear locally connected multilayer network have it can also be obtained effectively with numerical analysis not been carefully studied yet,and the previous results only
842 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 A and B define a critical point of E if and only if there exist an ordered p-index set Z and an invertible p x p matrix G such that A = CUkC,,Ci:, (4) B =VzG-’. (5) w = Pu,C,,C,-,l (6) For such a critical point we have and E(A, B) = trace(E,,) - Xi. iEZ Therefore, a critical W of rank p is always the product of the ordinary least-squares regression matrix followed by an orthogonal projection onto the subspace spanned by p eigenvectors of E. The critical map W associated with the index set { 1, . . . , p} is the unique local and global minimum of E. The remaining ( y) - 1 p-index sets correspond to saddle points. All additional critical points defined by matrices A and B which are not of full rank are also saddle points and can be characterized in terms of orthogonal projections onto subspaces spanned by q eigenvectors, with q < p. In the autoassociative case, (4)-(6) become A = CU; (7) B = UT,-’ (8) w = Pu, (9) and therefore the unique locally and globally optimal map W is the orthogonal projection onto the space spanned by the first p eigenvectors of Ezz. This analysis links backpropagation in linear networks to several classical statistical techniques. In particular, at the global minimum of E, if C = I, then the activities in the hidden layer are given by u\yt, . . . , ubyt, the principal components of the least-squares estimators ijt (see, for instance, [ 81). In the autoassociative mode, these activities are given by u\xt, . , ubxt, and correspond to the coordinates of the vector xt along the first p eigenvectors of E,, as in the usual ?CA. In general, if the initial conditions are random, one should not expect the backpropagation algorithm to converge to an optimum satisfying C = I,. In the autoassociative case, this means that the rows of the final A and u1, . . . , up will span the same space but A’ # [UI, . . . , U,]. Although at first sight this may seem a drawback, it must be regarded as a property leading to more robust networks. Indeed, in a physical implementation where the compressed version of the data in the hidden layer is to be sent to further processing layers, it may not be desirable that one of the units, extracting the principal component, has a variance much larger than the other units (it is known, for instance, that in the case of random symmetric matrixes, Xz << XI almost always; see [ll]). A more balanced strategy, where all the variances in the hidden layer are comparable, is by far preferable and is commonly observed in simulations. Since the optimal solution can be expressed analytically, it can also be obtained effectively with numerical analysis techniques without resorting to any descent procedure. As pointed out in the Introduction, however, this is not the most relevant point of view here where the emphasis is on the learning behavior and emergent organizational principles of simple adaptive networks. One of the central issues in learning from examples is the problem of generalization, that is, how does the network perform when exposed to a pattern never seen previously? In this setting, a precise quantitative answer can be given to this question. For instance, in the autoassociative case, the distortion of a new pattern is given by its distance to the subspace generated by the first p eigenvectors of Xzx. In the special case where rank(C,,) = r = p, Gallinari et al. [12] have shown that if an n - p - m architecture is trained to classify n-dimensional inputs into m (m < n) classes, then the corresponding network performs discriminant analysis in the sense that, for an optimal W = BA,A’ is a DA matrix. In other words, under these assumptions, the projection realized by A‘ maximizes the ratio given in (3). In this context, however, either p = r = m, in which case the architecture is n - m - m and there is no bottleneck, or r < m and then full classification into m categories is not supported by the available data and there is no proper data compression (only filtering out of linear dependencies). In any case, all the network ever learns is to be a mean-square classifier, and this can be achieved without any hidden layer. B. Deep Networks, Local Connectivity, Nonlinearities, and Bias In Baldi [13], the case of deep networks with multiple hidden layers is briefly examined. It is easy to see that, in this case, the main constraint on the network comes from its bottleneck, that is, from the hidden layer with smallest size p (clearly, p could be attained in more than one hidden layer). Although the expression for the critical points may now become more involved, the main features of the landscape are unchanged: a multiplicity of saddle points, an absence of local minima, and a unique optimal input/output map satisfying (6) with Z = {l,...,p} . The bottleneck layer imposes a rank restriction on the map computed by the network. Additional important constraints can be introduced on the geometry of the connections. Often connections are assumed to be local, in the sense that a unit in one layer receives projections only from a restricted subset of elements in the previous layer, for instance according to a Gaussian distribution. These geometrical constraints play an essential role in self-organizing maps and in several models of “linear” cortical development; see for instance Linsker [ 141-[ 161 and Miller et al. [ 171. These topics deserve separate treatment and will not be addressed here. As mentioned in the previous section, however, in the case of a locally connected linear network without any hidden layer the landscape of the usual quadratic error is again completely devoid of local minima. Learning by descent methods should then be efficient. The landscape properties of the LMS (least mean square) error function of a linear locally connected multilayer network have not been carefully studied yet, and the previous results only
BALDI AND HORNIK:LEARNING IN LINEAR NEURAL NETWORKS 843 give lower bounds.In particular,the question whether the zero (the case of"hard constraints").This leads to the problem error function has any local minimum remains open despite of minimizing(10)with A E A and B arbitrary,which clearly its disarming simplicity. has a well-defined solution.An optimal A must lie on the In the case of nonlinear units,few analytical results are boundary A of A (if not,we could find a>1 such that known,but certainly local minima do appear.An important 4A∈A). remark,however,has been made by Bourlard and Kamp [18]. Let us write Snn=R,where >0 measures the noise In the autoassociative mode,it is natural to use linear units level and R is some structure matrix (the simplest case is in the output layer.Under these conditions,nothing is to be R=7,but if the units are physically close it may be unnatural gained by using nonlinear elements in the hidden layer.This is to assume that the individual components of the noise are basically because the network is trying to approximate a linear uncorrelated).The explicit dependence of E on o can be taken map:the identity function.This result can be extended to any into account by writing linear map.That is,if the set of pairs (t,)of examples is such that y=F(rt)for every t with linear F,then nonlinear E(A,B)=E(A,B) units in the hidden layer can lead to an approximation of F =E(A.B)+o trace(BRB)+trace(.).(11) which is at best equivalent to the approximation obtainable by using linear units exclusively.Reports of simulations in the As soon as o I (for example),it is straightforward to see literature confirm this point and sometimes seem to indicate that the solutions of the problem of minimizing E with A E A that the solution found using nonlinear elements is "close"to are identical to the solutions of minimizing E with A E A PCA (Cottrell et al.[19]). and B E B,where B is some fixed compact set independent Finally,if it is not desirable to assume the existence of of o.By (11),as o-0,E(A,B)converges uniformly to a preprocessing stage where the data are centered,then the E(A,B)+trace(ee)over the compact set A x B.Since theory can easily be extended to the case of linear units with these two functions differ only by an additive constant,the bias (see,for instance,[18]and [20]for more details). solutions of the noisy constrained problem approach the set of all pairs of matrices (A,B)satisfying (4)and (5)with in C.Noise Analysis addition A E A(this automatically forces B to be in B,by restricting the matrices C).In other words,if A is the set of How robust are the previous results against the effects of all matrices A EA which are optimal for noise level o,then noise?Different sorts of noise can be introduced.for instance at the level of the synaptic weights or of the activation limA.={A∈aA:A=CUg∑with invertibleC} functions.To fix the ideas,assume in our case that the activation functions in both the hidden layer and the output (provided of course that the latter set is nonempty).That is, layer are"noisy."Hence for an input the output of the as is intuitively clear,if o is very small,the solutions of hidden layer is w Ax +n and the activity in the output the constrained noisy problem are essentially the same as the units is z Bw +e BA+Bn e.Assume that the solutions of the nonnoisy problem. noise terms n and e have mean zero,covariance matrices nn In the Appendix we show that as o-oo and Eee,and that they are uncorrelated with each other and with the patterns a and y.It is also reasonable to assume for min E(A,B)=trace(Syy+See) simplicity that is full rank.We are now interested in the -g-trace(MA'R-A)+O(o2) problem of minimizing uniformly over A,where M =Ery.Hence,if a is E(A,B)=(lly-(BAx +Bn +e)2) very large,minimizing E over A is essentially equivalent to =E(A,B)+trace(B∑nB)+trace(∑ee).(IO) maximizing (A)trace(MA'R-A)over A.The solution set A to this "asymptotic problem"depends significantly on Observe that the term trace()is just an additive con- the choice of the constraint set A.For instance,if A =A stant and has no influence on the variation of E with A Al=trace(AA')<p)(AllF)is the so-called Frobenius and B.For any positive u,E(uA,B/u)-E(A,B)= norm of A),then Aa consists of all A of the form pum', trace(Bnn B)(1-u)/u.Thus,if B 0 and u>1,then where u and v are normalized principal eigenvectors of Af and E(uA,B/u)<E(A,B).As a result,without any additional R-1,respectively.On the other hand,if A=(A AA'=Ip} constraints,there is no pair (A,B)which minimizes E.This (i.e.,the rows of A are orthonormal)and R=I,then the is intuitively clear,as the network will try to make A as large rows of the optimal A span the space of the first p principal as possible and/or B as small as possible so that the signal eigenvectors of M (for details,see the Appendix).Now notice dominates the noise.It is therefore necessary to restrict the that AA'=I implies trace(AA')=p.Hence in the high- power of the signal Ar.One way of accomplishing this is noise case,full PCA of M is inferior to extraction of the first to introduce"soft constraints"by adding penalty terms like principal component of M only.Or in other words,it is better (IAx2)or trace(AA')to E.Some results in this direction not to force the rows of A to orthonormality,but allow them have been obtained in Plumbley [21J. to cooperate(build"balanced"representations)instead.In this The other possibility,which we shall consider here in more sense,if o is very large and A is "rich"enough,the solutions detail,is to explicitly restrict A to some compact subset A of of the constrained noisy problem are of maximum redundancy the set of all p x n matrices,for instance,a sphere centered at where all the hidden units try to do the same thing
BALD1 AND HORNIK: LEARNING IN LINEAR NEURAL NETWORKS 843 give lower bounds. In particular, the question whether the error function has any local minimum remains open despite its disarming simplicity. In the case of nonlinear units, few analytical results are known, but certainly local minima do appear. An important remark, however, has been made by Bourlard and Kamp [ 181. In the autoassociative mode, it is natural to use linear units in the output layer. Under these conditions, nothing is to be gained by using nonlinear elements in the hidden layer. This is basically because the network is trying to approximate a linear map: the identity function. This result can be extended to any linear map. That is, if the set of pairs (xt, yt) of examples is such that yt = F(xt) for every t with linear F, then nonlinear units in the hidden layer can lead to an approximation of F which is at best equivalent to the approximation obtainable by using linear units exclusively. Reports of simulations in the literature confirm this point and sometimes seem to indicate that the solution found using nonlinear elements is “close” to PCA (Cottrell et al. [19]). Finally, if it is not desirable to assume the existence of a preprocessing stage where the data are centered, then the theory can easily be extended to the case of linear units with bias (see, for instance, [18] and [20] for more details). C. Noise Analysis How robust are the previous results against the effects of noise? Different sorts of noise can be introduced, for instance at the level of the synaptic weights or of the activation functions. To fix the ideas, assume in our case that the activation functions in both the hidden layer and the output layer are “noisy.” Hence for an input x, the output of the hidden layer is w = Ax + n and the activity in the output units is z = Bw + e = BAx + Bn + e. Assume that the noise terms n and e have mean zero, covariance matrices E,, and E,,, and that they are uncorrelated with each other and with the patterns x and y. It is also reasonable to assume for simplicity that E,, is full rank. We are now interested in the problem of minimizing ,@(A, B) = (Ily - (BA2 + Bn + e)1I2) = E(A, B) + trace(BC,,B’) + trace(&,). (10) Observe that the term trace@,,) is just an additive constant and has no influence on the variation of E with A and B. For any positive p, E(pA,B/p) - ,@(A,B) = trace(BC,,B’)(l - p2)/p2. Thus, if B # 0 and p > 1, then ,@(PA, B/p) < &(A, B). As a result, without any additional constraints, there is no pair (A, B) which minimizes E. This is intuitively clear, as the network will try to make A as large as possible and/or B as small as possible so that the signal dominates the noise. It is therefore necessary to restrict the power of the signal Ax. One way of accomplishing this is to introduce “soft constraint_s” by adding penalty terms like (11A~11~) or trace(AA’) to E. Some results in this direction have been obtained in Plumbley [21]. The other possibility, which we shall consider here in more detail, is to explicitly restrict A to some compact subset A of the set of all p x n matrices, for instance, a sphere centered at zero (the case of “hard constraints”). This leads to the problem of minimizing (10) with A E A and B arbitrary, which clearly has a well-defined solution. An optimal A must lie on the boundary dA of A (if not, we could find a p > 1 such that pA E A). Let us write E,, = OR, where u>0 measures the noise level and R is some structure matrix (the simplest case is R = I, but if the units are physically close it may be unnatural to assume that the individual component_s of the noise are uncorrelated). The explicit dependence of E on (T can be taken into account by writing E(A, B) = E,(A, B) = E(A, B)+a trace(BRB’) +trace( Eee). (1 1) As soon as (T 5 1 (for example), it is straightforward to see that the solutions of the problem of minimizing-,!? with A E A are identical to the solutions of minimizing E with A E A and B E B, where B is som? fixed compact set independent of o. By (ll), as o -+ 0, E(A,B) converges uniformly to E(A,B) + trace(C,,) over the compact set A x B. Since these two functions differ only by an additive constant, the solutions of the noisy constrained problem approach the set of all pairs of matrices (A,B) satisfying (4) and (5) with in addition A E A (this automatically forces B to be in B, by restricting the matrices C). In other words, if A, is the set of all matrices A E A which are optimal for noise level g, then lim A, = {A E dA : A = CU~C,,C~~ with invertiblec} (provided of course that the latter set is nonempty). That is, as is intuitively clear, if (T is very small, the solutions of the constrained noisy problem are essentially the same as the solutions of the nonnoisy problem. a-0 In the Appendix we show that as o -+ m min B ,@,(A, B) = trace&, + Xee) - a-ltrace(MA’R-lA) + O((T-~) uniformly over A, where M = E,,C,,. Hence, if o is very large, minimizing E, over A is essentially equivalent to maximizing @(A) = trace(MA’R-’A) over A. The solution set A+ to this “asymptotic problem” depends significantly on the choice of the constraint set A. For instance, if A = {A : llAll$ = trace(AA’) 5 p}(llAll~) is the so-called Frobenius norm of A), then A+ consists of all A of the form fit”, where U and U are normalized principal eigenvectors of M and R-l, respectively. On the other hand, if A = {A : AA’ = I,} (i.e., the rows of A are orthonormal) and R = I, then the rows of the optimal A span the space of the first p principal eigenvectors of M (for details, see the Appendix). Now notice that AA’ = I, implies trace(AA’) = p. Hence in the highnoise case, full PCA of M is inferior to extraction of the first principal component of M only. Or in other words, it is better not to force the rows of A to orthonormality, but allow them to cooperate (build “balanced” representations) instead. In this sense, if o is very large and A is “rich” enough, the solutions of the constrained noisy problem are of maximum redundancy where all the hidden units try to do the same thing
844 IEEE TRANSACTIONS ON NEURAL NETWORKS.VOL.6,NO.4.JULY 1995 Significantly refined results for the autoassociative case the identity map in a single-layer feedforward linear network. (where M =2)have been given in Diamantaras and With suitable assumptions on the noise,this setting turns Hornik [22].They show that for arbitrary invertible and out to be insightful and to yield analytical results which are orthogonally right-invariant A (i.e.,AY E A if A E A relevant to what one observes in more complicated situations. and Y is orthogonal),E is minimized for matrices A of the Here,we first define our framework and derive the basic form CU'for suitable C(as usual,the columns of U are equations,first in the noiseless case and then in the case of the eigenvectors of Ez).Under the constraint AA'=Ip, noisy data.The basic point is to derive an expression for the the minima are attained at A =>p are normalized eigenvectors ofith the corresponding validation function in terms of the statistical properties of the population and the training and validation samples.We then eigenvalues arranged in decreasing order.Under the Frobenius examine the main results which consist of an analysis of the norm constraint trace(AA')<p,the minima are attained at landscape of the validation error as a function of training A of the form where the and the rank time.Simple simulation results are also presented,and several depend on the eigenvalues of and In particular,if interesting phenomena are described.The results are discussed nn=aR as before,thenr=r()is nonincreasing with in the conclusion,and some possible extensions are briefly r()=p for all o sufficiently small and r(o)=1 for all o mentioned.Mathematical proofs are deferred to the Appendix. sufficiently large.This result formalizes the intuition that the We consider a simple feedforward network with n input units should increase"cooperation"along with the noise level. units connected by a weight matrix A to n linear output Further generalizations are possible by considering nonMSE units.I The network is trained to learn the identity function measures of the "size"of the linear reconstruction errors (autoassociation)from a set of centered training patterns =-(BAx+Bn+e);see Hornik [23].In particular,in the 1,,r.The connection weights are adjusted by gradient Gaussian case,the determinant of (wu)measures the amount descent on the usual LMS error function of transmitted information,and its constrained maximization is intimately related to the INFOMAX principle of Linsker [9]. E(A)=(x-Ax2). The gradient of E with respect to the weights A is IV.GENERALIZATION VE=(A-IX This section is written with Y.Chauvin and is a modified version of the article "Temporal Evolution of Generalization where =Ez is the covariance matrix of the training set. During Learning in Linear Networks"[24].The material, Thus,the gradient descent learning rule can be expressed as copyrighted by MIT Press,was included here with permission from the publisher. A(k+1)=A(k)-7(A()-I)8 where n is the constant learning rate.Simple induction shows A.Formal Setting that In practice,the question to be answered is how should one allocate limited resources and parameters,such as network 4(=(I-(I-))+A(0)(I-nE)* size and architecture,initial conditions,training time,and available examples,to optimize generalization performance? Hence if u:andλ,(A1≥·≥入n>O)denote the One conventional approach is to consider the problem of eigenvectors and eigenvalues of∑,then learning as a surface fitting problem.Accordingly,neural A(k+1):=(1-(1-n入))*)+(1-n入:)A(0)u. networks should be very constrained,with a minimal number (12) of parameters,to avoid the classical overfitting problem.In practice,however,not too much is known about overfitting and The behavior of (12)is clear:provided the learning rate is its onset,both as a function of network parameters and training less than the inverse of the largest eigenvalue (n<1/A1), time.Furthermore,the conventional view can be challenged.It A(k)approaches the identity exponentially fast.This holds may be the case,for instance,that a suitable strategy consists for any starting matrix A(0).The eigenvectors of tend to rather in using networks with a few extra parameters.These become eigenvectors of A(k),and the corresponding eigen- larger networks must be used in conjunction with nontrivial values approach one at different rates depending on A:(larger priors in a Bayesian framework and/or trained for shorter eigenvalues are learned much faster).As a result,it is not times,based on a careful monitoring of the validation error very restrictive to assume,for ease of exposition,that the early-stopping”"). starting matrix A(0)is diagonal in the ui basis,that is, Partial initial results on generalization problems have A(0)=U diag(ai(0))U',where as usual.U =[u1,...,un]. been obtained in recent years in terms of VC (Vap- nik-Chervonenkis)dimension and statistical mechanics (see, IKrogh and Hertz 128]have independently analyzed the evolution of for instance,[25]-[27]).Here,we propose a different and generalization in the case of one single linear unit.It can be shown that the evolution curve can assume one of several possible shapes,depending on complementary approach consisting of a detailed analysis of a number of parameters.Although in the absence of any hidden layer there is generalization in simple feedforward linear networks.Even no coupling between the output units of an n-n network,it is still necessary in this simple framework,the questions are far from trivial. to study the n-n case since the corresponding evolution function results from the summation of the evolution curves of cach output unit,each such curve Thus we have restricted the problem even further:learning being capable of assuming a different shape with different characteristics
844 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 6, NO. 4, JULY 1995 Significantly refined results for the autoassociative case (where M = E:,) have been given in Diamantaras and Homik [22]. They show that for arbitrary invertible E,, and orthogonally right-inv-ant A (i.e., AY E A if A E A and Y is orthogonal), E is minimized for matrices A of the form CU’ for suitable C (as usual, the columns of U are the eigenvectors of E,,). Under the constraint AA’ = Ip, the minima are attained at A = Cy=‘=, viui, where VI, . . . , up are normalized eigenvectors of R-’ with the corresponding eigenvalues arranged in decreasing order. Under the Frobenius norm constraint trace(AA’) 5 p, the minima are attained at A of the form ~~=’=, fiyiviu:, where the y; and the rank T depend on the eigenvalues of E,, and E,,. In particular, if E,, = aR as before, then T = r(a) is nonincreasing with .(a) = p for all a sufficiently small and .(a) = 1 for all a sufficiently large. This result formalizes the intuition that the units should increase “cooperation” along with the noise level. Further generalizations are possible by considering nonMSE measures of the “size” of the linear reconstruction errors w = y - (BA2 + Bn + e); see Homik [23]. In particular, in the Gaussian case, the determinant of (ww’) measures the amount of transmitted information, and its constrained maximization is intimately related to the INFOMAX principle of Linsker [9]. the identity map in a single-layer feedforward linear network. With suitable assumptions on the noise, this setting tums out to be insightful and to yield analytical results which are relevant to what one observes in more complicated situations. Here, we first define our framework and derive the basic equations, first in the noiseless case and then in the case of noisy data. The basic point is to derive an expression for the validation function in terms of the statistical properties of the population and the training and validation samples. We then examine the main results which consist of an analysis of the landscape of the validation error as a function of training time. Simple simulation results are also presented, and several interesting phenomena are described. The results are discussed in the conclusion, and some possible extensions are briefly mentioned. Mathematical proofs are deferred to the Appendix. We consider a simple feedforward network with n input units connected by a weight matrix A to n linear output units.’ The network is trained to learn the identity function (autoassociation) from a set of centered training patterns 21, . . . , ZT. The connection weights are adjusted by gradient descent on the usual LMS error function The gradient of E with respect to the weights A is IV. GENERALIZATION VE 1 (A - I)C This section is written with Y. Chauvin and is a modified version of the article “Temporal Evolution of Generalization During Learning in Linear I’ktWorks” wl. The material, copyrighted by MIT Press, was included here with permission where E = E,, is the covariance matrix of the training set. Thus, the gradient descent learning rule can be expressed as from the publisher. A(k + 1) = A(k) - v(A(k) - I)C A. Formal Setting In practice, the question to be answered is how should one allocate limited resources and parameters, such as network size and architecture, initial conditions, training time, and available examples, to optimize generalization performance? One conventional approach is to consider the problem of learning as a surface fitting problem. Accordingly, neural networks should be very constrained, with a minimal number of parameters, to avoid the classical overfitting problem. In practice, however, not too much is known about overfitting and its onset, both as a function of network parameters and training time. Furthermore, the conventional view can be challenged. It may be the case, for instance, that a suitable strategy consists rather in using networks with a few extra parameters. These larger networks must be used in conjunction with nontrivial priors in a Bayesian framework and/or trained for shorter times, based on a careful monitoring of the validation error (“early-stopping”). Partial initial results on generalization problems have been obtained in recent years in terms of VC (Vapnikxhervonenkis) dimension and statistical mechanics (see, for instance, [25]-[27]). Here, we propose a different and complementary approach consisting of a detailed analysis of generalization in simple feedforward linear networks. Even in this simple framework, the questions are far from trivial. Thus we have restricted the problem even further: learning where 7 is the constant learning rate. Simple induction shows that A(k) = (I - (I - qC)‘) + A(O)(I - QE)‘. Hence if U; and A, (A1 2 . .. 2 A, > 0) denote the eigenvectors and eigenvalues of C, then A(k + l)u, = (1 - (1 - qX,)’)uz + (1 - ~A,)‘~4(0)u,. (12) The behavior of (12) is clear: provided the learning rate is less than the inverse of the largest eigenvalue (Q < l/Al), A(k) approaches the identity exponentially fast. This holds for any starting matrix A(0). The eigenvectors of C tend to become eigenvectors of A( IC), and the corresponding eigenvalues approach one at different rates depending on A; (larger eigenvalues are learned much faster). As a result, it is not very restrictive to assume, for ease of exposition, that the starting matrix A(0) is diagonal in the U; basis, that is, A(0) = Udiag(ai(O))U’, where as usual, U = [ul,...,~,] . ‘Krogh and Hertz [28] have independently analyzed the evolution of generalization in the case of one single linear unit. It can be shown that the evolution curve can assume one of several possible shapes, depending on a number of parameters. Although in the absence of any hidden layer there is no coupling between the output units of an R - n network, it is still necessary to study the n - n case since the corresponding evolution function results from the summation of the evolution curves of each output unit, each such curve being capable of assuming a different shape with different characteristics