Principe,J C."Artificial Neural Networks The Electrical Engineering Handbook Ed. Richard C. Dorf Boca raton crc Press llc. 2000
Principe, J.C. “Artificial Neural Networks” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
20 Artificial Neural Networks 0.1 Definitions and s Introduction·Defi and Style of Computation. ANN Types 0. 2 Multilayer Perceptrons Function of Each PE. How to Train MLPs. Applying Back Propagation in Practice. A Posteriori Probabilities 20.3 Radial Basis function Networks 20.4 Time Lagged Networks Memory Structures. Training-Focused TLN Architectures 20.5 Hebbian Learning and Principal Component Analysis Hebbian Learning. Principal Component Analysis.Associative Jose C. Principe Memories 0.6 Competitive Learning and Kohonen Networks 20.1 Definitions and Scope Introduction Artificial neural networks(ANN)are among the newest signal-processing technologies in the engineer's toolbox. he field is highly interdisciplinary, but our approach will restrict the view to the engineering perspective. In engineering, neural networks serve two important functions: as pattern classifiers and as nonlinear adaptive filters. We will provide a brief overview of the theory, learning rules, and applications of the most important neural network models Definitions and Style of Computation An ANN is an adaptive, most often nonlinear system that learns to perform a function(an input/output map from data. Adaptive means that the system parameters are changed during operation, normally called the training phase. After the training phase the anN parameters are fixed and the system is deployed to solve the problem at hand(the testing phase). The ANN is built with a systematic step-by-step procedure to optimize a performance criterion or to follow some implicit internal constraint, which is commonly referred to as the learning rule. The input/output training data are fundamental in neural network technology, because they convey the necessary information to"discover"the optimal operating point. The nonlinear nature of the neural network processing elements(PEs)provides the system with lots of flexibility to achieve practically any desired input/output map, i. e, some ANNs are universal mappers. There is a style in neural computation that is worth describing(Fig. 20. 1). An input is presented to the network and a corresponding desired or target response set at the output(when this is the case the training is called supervised). An error is composed from the difference between the desired response and the system c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 20 Artificial Neural Networks 20.1 Definitions and Scope Introduction • Definitions and Style of Computation • ANN Types and Applications 20.2 Multilayer Perceptrons Function of Each PE • How to Train MLPs • Applying BackPropagation in Practice • A Posteriori Probabilities 20.3 Radial Basis Function Networks 20.4 Time Lagged Networks Memory Structures • Training-Focused TLN Architectures 20.5 Hebbian Learning and Principal Component Analysis Networks Hebbian Learning • Principal Component Analysis • Associative Memories 20.6 Competitive Learning and Kohonen Networks 20.1 Definitions and Scope Introduction Artificial neural networks (ANN) are among the newest signal-processing technologies in the engineer’s toolbox. The field is highly interdisciplinary, but our approach will restrict the view to the engineering perspective. In engineering, neural networks serve two important functions: as pattern classifiers and as nonlinear adaptive filters. We will provide a brief overview of the theory, learning rules, and applications of the most important neural network models. Definitions and Style of Computation An ANN is an adaptive, most often nonlinear system that learns to perform a function (an input/output map) from data. Adaptive means that the system parameters are changed during operation, normally called the training phase. After the training phase the ANN parameters are fixed and the system is deployed to solve the problem at hand (the testing phase). The ANN is built with a systematic step-by-step procedure to optimize a performance criterion or to follow some implicit internal constraint, which is commonly referred to as the learning rule. The input/output training data are fundamental in neural network technology, because they convey the necessary information to “discover” the optimal operating point. The nonlinear nature of the neural network processing elements (PEs) provides the system with lots of flexibility to achieve practically any desired input/output map, i.e., some ANNs are universal mappers. There is a style in neural computation that is worth describing (Fig. 20.1). An input is presented to the network and a corresponding desired or target response set at the output (when this is the case the training is called supervised). An error is composed from the difference between the desired response and the system Jose C. Principe University of Florida
Y2 Y1 DI D2 Q公A图汁豆和器器 yp yp dp dp FIGURE 20.1 The style of neural computation. output. This error information is fed back to the system and adjusts the system parameters in a systemati fashion( the learning rule). The process is repeated until the performance is acceptable. It is clear from this description that the performance hinges heavily on the data. If one does not have data that cover a significant portion of the operating conditions or if they are noisy, then neural network technology is probably not the right solution. On the other hand, if there is plenty of data and the problem is poorly understood to derive an approximate model, then neural network technology is a good choice. This operating procedure should be contrasted with the traditional engineering design, made of exhaustive subsystem specifications and intercommunication protocols In ANNs, the designer chooses the network topol ogy, the performance function, the learning rule, and the criterion to stop the training phase, but the system automatically adjusts the parameters. So, it is difficult to bring a priori information into the design, and when the system does not work properly it is also hard to incrementally refine the solution. But ANN-based solutions are extremely efficient in terms of development time and resources, and in many difficult problems ANNs provide performance that is difficult to match with other technologies. Denker 10 years ago said that "ANNs are the second best way to implement a solution"motivated by the simplicity of their design and because of their universality, only shadowed by the traditional design obtained by studying the physics of the problem.At resent, ANNs are emerging as the technology of choice for many applications, such as pattern recognition, prediction, system identification, and control ANN TyPes and Applications It is always risky to establish a taxonomy of a technology, but our motivation is one of providing a quick overview of the application areas and the most popular topologies and learning paradigms Association Hopfield [Zurada, 1992; Haykin, 1994 Hebbian [Zurada, 1992; Haykin, 1994; Kung, 1993] Multilayer perceptron [Zurada, 1992; Haykin, 1994; Back-propagation [Zurada, 1992; op,1995 Haykin, 1994; Bishop, 1995] Linear associative mem. [Zurada, 1992; Haykin, 1994 Hebbian Pattern Multilayer perceptron [Zurada, 1992; Haykin, 1994; Back-propagation Radial basis functions [Zurada, 1992: Bishop, 1995 Least mean squa [Bishop, 1995] Competitive [Zurada, 1992; Haykin, 1994 extraction Kohonen [Zurada, 1992; Haykin, 1994 Multilayer perceptron [Kung, 19931 Back-propagatio Principal comp. anal. [Zurada, 1992; Kung, 19931 Oja's [Zurada, 1992; Kung, 1993] Prediction, Time-lagged networks [Zurada, 1992; Kung, 1993; Back-propagation through time system ID de vries and Principe, 1992] [Zurada, 19921 Fully recurrent nets [Zurada, 1992] C 2000 by CRC Press LLC
© 2000 by CRC Press LLC output. This error information is fed back to the system and adjusts the system parameters in a systematic fashion (the learning rule). The process is repeated until the performance is acceptable. It is clear from this description that the performance hinges heavily on the data. If one does not have data that cover a significant portion of the operating conditions or if they are noisy, then neural network technology is probably not the right solution. On the other hand, if there is plenty of data and the problem is poorly understood to derive an approximate model, then neural network technology is a good choice. This operating procedure should be contrasted with the traditional engineering design, made of exhaustive subsystem specifications and intercommunication protocols. In ANNs, the designer chooses the network topology, the performance function, the learning rule, and the criterion to stop the training phase, but the system automatically adjusts the parameters. So, it is difficult to bring a priori information into the design, and when the system does not work properly it is also hard to incrementally refine the solution. But ANN-based solutions are extremely efficient in terms of development time and resources, and in many difficult problems ANNs provide performance that is difficult to match with other technologies. Denker 10 years ago said that “ANNs are the second best way to implement a solution” motivated by the simplicity of their design and because of their universality, only shadowed by the traditional design obtained by studying the physics of the problem. At present, ANNs are emerging as the technology of choice for many applications, such as pattern recognition, prediction, system identification, and control. ANN Types and Applications It is always risky to establish a taxonomy of a technology, but our motivation is one of providing a quick overview of the application areas and the most popular topologies and learning paradigms. FIGURE 20.1 The style of neural computation. Supervised Unsupervised Application Topology Learning Learning Association Hopfield [Zurada, 1992; Haykin, 1994] — Hebbian [Zurada, 1992; Haykin, 1994; Kung, 1993] Multilayer perceptron [Zurada, 1992; Haykin, 1994; Bishop, 1995] Back-propagation [Zurada, 1992; Haykin, 1994; Bishop, 1995] — Linear associative mem. [Zurada, 1992; Haykin, 1994] — Hebbian Pattern recognition Multilayer perceptron [Zurada, 1992; Haykin, 1994; Bishop, 1995] Back-propagation — Radial basis functions [Zurada, 1992; Bishop, 1995] Least mean square k-means [Bishop, 1995] Feature extraction Competitive [Zurada, 1992; Haykin, 1994] — Competitive Kohonen [Zurada, 1992; Haykin, 1994] — Kohonen Multilayer perceptron [Kung, 1993] Back-propagation — Principal comp. anal. [Zurada, 1992; Kung, 1993] — Oja’s [Zurada, 1992; Kung, 1993] Prediction, system ID Time-lagged networks [Zurada, 1992; Kung, 1993; de Vries and Principe, 1992] Back-propagation through time [Zurada, 1992] — Fully recurrent nets [Zurada, 1992]
input layer dden layer output layer 0吃 m FIGURE 20.2 MLP with one hidden layer(d-k-m) PE net f(net) tanh(anet/=I+exp( I net>0 f=O net= o ∑x+b I (bias) Tanh Logistic Threshold figURe 20. 3 A PE and the most common nonlinearities It is clear that multilayer perceptrons(MLPs), the back-Propagation algorithm and its extensions-time-lagged networks(TLN)and back-propagation through time(BPTT), respectively -hold a prominent position in ANN technology. It is therefore only natural to spend most of our overview presenting the theory and tools of back-propagation learning. It is also important to notice that Hebbian learning (and its extension, the Oja rule) is also a very useful (and biologically plausible) learning mechanism. It is an unsupervised learning method since there is no need to specify the desired or target response to the ann 20.2 Multilayer Perceptrons Multilayer perceptrons are a layered arrangement of nonlinear PEs as shown in Fig. 20. 2. The layer that receives ne input is called the input layer, and the layer that produces the output is the output layer. The layers that do not have direct access to the external world are called hidden layers. a layered network with just the input and output layers is called the perceptron. Each connection between PEs is weighted by a scalar, w, called a weight, which is adapted during learn The PEs in the MLP are composed of an adder followed by a smooth saturating nonlinearity of the sigmoid pe(Fig. 20.3). The most common saturating nonlinearities are the logistic function and the hyperbolic tangent. The threshold is used in other nets. The importance of the mLP is that it is a universal mapper (implements arbitrary input/output maps)when the topology has at least two hidden layers and sufficient number of PEs [Haykin, 1994]. Even MLPs with a single hidden layer are able to approximate continuous input/output maps. This means that rarely we will need to choose topologies with more than two hidden layers. But these are existence proofs, so the issue that we must solve as engineers is to choose how many layers and how many PEs in each layer are required to produce good results Many problems in engineering can be thought of in terms of a transformation of an input space, containing the input, to an output space where the desired response exists. For instance, dividing data into classes can be thought of as transforming the input into 0 and 1 responses that will code the classes [Bishop, 1995]. Likewise entification of an unknown system can also be framed as a mapping(function approximation) from the input to the system output [Kung, 1993]. The MLP is highly recommended for these applications c 2000 by CRC Press LLC
© 2000 by CRC Press LLC It is clear that multilayer perceptrons (MLPs), the back-propagation algorithm and its extensions — time-lagged networks (TLN) and back-propagation through time (BPTT), respectively — hold a prominent position in ANN technology. It is therefore only natural to spend most of our overview presenting the theory and tools of back-propagation learning. It is also important to notice that Hebbian learning (and its extension, the Oja rule) is also a very useful (and biologically plausible) learning mechanism. It is an unsupervised learning method since there is no need to specify the desired or target response to the ANN. 20.2 Multilayer Perceptrons Multilayer perceptrons are a layered arrangement of nonlinear PEs as shown in Fig. 20.2. The layer that receives the input is called the input layer, and the layer that produces the output is the output layer. The layers that do not have direct access to the external world are called hidden layers. A layered network with just the input and output layers is called the perceptron. Each connection between PEs is weighted by a scalar, wi , called a weight, which is adapted during learning. The PEs in the MLP are composed of an adder followed by a smooth saturating nonlinearity of the sigmoid type (Fig. 20.3). The most common saturating nonlinearities are the logistic function and the hyperbolic tangent. The threshold is used in other nets. The importance of the MLP is that it is a universal mapper (implements arbitrary input/output maps) when the topology has at least two hidden layers and sufficient number of PEs [Haykin, 1994]. Even MLPs with a single hidden layer are able to approximate continuous input/output maps. This means that rarely we will need to choose topologies with more than two hidden layers. But these are existence proofs, so the issue that we must solve as engineers is to choose how many layers and how many PEs in each layer are required to produce good results. Many problems in engineering can be thought of in terms of a transformation of an input space, containing the input, to an output space where the desired response exists. For instance, dividing data into classes can be thought of as transforming the input into 0 and 1 responses that will code the classes [Bishop, 1995]. Likewise, identification of an unknown system can also be framed as a mapping (function approximation) from the input to the system output [Kung, 1993]. The MLP is highly recommended for these applications. FIGURE 20.2 MLP with one hidden layer (d-k-m). FIGURE 20.3 A PE and the most common nonlinearities
(x1,x2)=w:x1+w2x2+b=0 XI FIGURE 20.4 A two-input PE and its separation surface. Function of each pe Let us study briefly the function of a single PE with two inputs [Zurada, 1992]. If the nonlinearity is the threshold nonlinearity we can immediately see that the output is simply I and-1. The surface that divides these subspaces is called a separation surface, and in this case it is a line of equation yW, w)=w,*,+w,,+b (20.1) i.e., the PE weights and the bias control the orientation and position of the separation line, respectively (Fig. 20.4). In many dimensions the separation surface becomes an hyperplane of dimension one less than the dimensionality of the input space. So, each PE creates a dichotomy in the input space. For smooth nonlinearities ne separation surface is not crisp; it becomes fuzzy but the same principles apply. In this case, the size of the eights controls the width of the fuzzy boundary (larger weights shrink the fuzzy boundary) The perceptron input/output map is built from a juxtaposition of linear separation surfaces, so the perceptron gives zero classification error only for linearly separable classes(i.e,, classes that can be exactly classified by hyperplanes). When one adds one layer to the perceptron creating a one hidden layer MLP, the type of separation surfaces hanges drastically. It can be shown that this learning machine is able to create"bumps"in the input space, i.e., an area of high response surrounded by low responses [Zurada, 1992]. The function of each PE is always the same, no matter if the PE is part of a perceptron or an MLP. However, notice that the output layer in the MLP works with the result of hidden layer activations, creating an embedding of functions and producing more complex separation surfaces. The one-hidden-layer MLP is able to produce nonlinear separation surfaces If one adds an extra layer(i. e, two hidden layers), the learning machine now can combine at will bumps, which can be interpreted as a universal mapper, since there is evidence that any function can be approximated by localized bumps. One important aspect to remember is that changing a single weight in the MLP can drastically change the location of the separation surfaces; i. e, the MLP achieves the input/output map through he interplay of all its weights How to Train mlps One fundamental issue is how to adapt the weights w, of the MLP to achieve a given input/output map. The core ideas have been around for many years in optimization, and they are extensions of well-known engineering principles, such as the least mean square(LMs)algorithm of adaptive filtering [Haykin, 1994. Let us review the theory here. Assume that we have a linear PE ((net)=net)and that one wants to adapt the weights as to minimize the square difference between the desired signal and the PE response(Fig. 20.5) This problem has an analytical solution known as the least squares[Haykin, 1994]. The optimal weights are obtained as the product of the inverse of the input autocorrelation function(R) and the cross-correlation vector()between the input and the desired response. The analytical solution is equivalent to a search for the minimum of the quadratic performance surface J(w )using gradient descent, where the weights at each iteration k are adjusted by c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Function of Each PE Let us study briefly the function of a single PE with two inputs [Zurada, 1992]. If the nonlinearity is the threshold nonlinearity we can immediately see that the output is simply 1 and –1. The surface that divides these subspaces is called a separation surface, and in this case it is a line of equation (20.1) i.e., the PE weights and the bias control the orientation and position of the separation line, respectively (Fig. 20.4). In many dimensions the separation surface becomes an hyperplane of dimension one less than the dimensionality of the input space. So, each PE creates a dichotomy in the input space. For smooth nonlinearities the separation surface is not crisp; it becomes fuzzy but the same principles apply. In this case, the size of the weights controls the width of the fuzzy boundary (larger weights shrink the fuzzy boundary). The perceptron input/output map is built from a juxtaposition of linear separation surfaces, so the perceptron gives zero classification error only for linearly separable classes (i.e., classes that can be exactly classified by hyperplanes). When one adds one layer to the perceptron creating a one hidden layer MLP, the type of separation surfaces changes drastically. It can be shown that this learning machine is able to create “bumps” in the input space, i.e., an area of high response surrounded by low responses [Zurada, 1992]. The function of each PE is always the same, no matter if the PE is part of a perceptron or an MLP. However, notice that the output layer in the MLP works with the result of hidden layer activations, creating an embedding of functions and producing more complex separation surfaces. The one-hidden-layer MLP is able to produce nonlinear separation surfaces. If one adds an extra layer (i.e., two hidden layers), the learning machine now can combine at will bumps, which can be interpreted as a universal mapper, since there is evidence that any function can be approximated by localized bumps. One important aspect to remember is that changing a single weight in the MLP can drastically change the location of the separation surfaces; i.e., the MLP achieves the input/output map through the interplay of all its weights. How to Train MLPs One fundamental issue is how to adapt the weights wi of the MLP to achieve a given input/output map. The core ideas have been around for many years in optimization, and they are extensions of well-known engineering principles, such as the least mean square (LMS) algorithm of adaptive filtering [Haykin, 1994]. Let us review the theory here. Assume that we have a linear PE (f(net) = net) and that one wants to adapt the weights as to minimize the square difference between the desired signal and the PE response (Fig. 20.5). This problem has an analytical solution known as the least squares [Haykin, 1994]. The optimal weights are obtained as the product of the inverse of the input autocorrelation function (R–1) and the cross-correlation vector (P) between the input and the desired response. The analytical solution is equivalent to a search for the minimum of the quadratic performance surface J(wi ) using gradient descent, where the weights at each iteration k are adjusted by FIGURE 20.4 A two-input PE and its separation surface. y w w w x w x b 1 2 1 1 2 2 ( , ) = + + = 0