《机器学习 Machine Learning》课程教学资源（书籍文献）Finding Structure in Time.pdf_P6-P10

FINDING STRUCTURE IN TIME 185 Consider now the results of applying this architecture to a number of problems that involve processing of inputs which are naturally presented in sequence. EXCLUSIVE-OR The Exclusive-Or(XOR)function has been of interest because it cannot be learned by a simple two-layer network.Instead,it requires at least three layers.The XOR is usually presented as a problem involving 2-bit input vectors(00,11,01,10)yielding 1-bit output vectors (0,0,1,1,respectively). This problem can be translated into a temporal domain in several ways. One version involves constructing a sequence of 1-bit inputs by presenting the 2-bit inputs one bit at a time (i.e.,in 2 time steps),followed by the 1-bit output;then continuing with another input/output pair chosen at random. A sample input might be: 101000011110101.. Here,the first and second bits are XOR-ed to produce the third;the fourth and fifth are XOR-ed to give the sixth;and so on.The inputs are concatenated and presented as an unbroken sequence. In the current version of the XOR problem,the input consisted of a sequence of 3,000 bits constructed in this manner.This input stream was presented to the network shown in Figure 2(with 1 input unit,2 hidden units, 1 output unit,and 2 context units),one bit at a time.The task of the network was,at each point in time,to predict the next bit in the sequence.That is, given the input sequence shown,where one bit at a time is presented,the correct output at corresponding points in time is shown below. input:101000011110101.. output:01000011110101?.. Recall that the actual input to the hidden layer consists of the input shown above,as well as a copy of the hidden unit activations from the previous cycle.The prediction is thus based not just on input from the world,but also on the network's previous state (which is continuously passed back to itself on each cycle). Notice that,given the temporal structure of this sequence,it is only some- times possible to predict the next item correctly.When the network has received the first bit-1 in the example above-there is a 50%chance that the next bit will be a 1 (or a 0).When the network receives the second bit (0), however,it should then be possible to predict that the third will be the XOR, 1.When the fourth bit is presented,the fifth is not predictable.But from the fifth bit,the sixth can be predicted,and so on

FINDING STRUCTURE IN TIME 185 Consider now the results of applying this architecture to a number of problems that involve processing of inputs which are naturally presented in sequence. EXCLUSIVE-OR The Exclusive-Or (XOR) function has been of interest because it cannot be learned by a simple two-layer network. Instead, it requires at least three layers. The XOR is usually presented as a problem involving 2-bit input vectors (00, 1 1, 01, 10) yielding l-bit output vectors (0, 0, 1, 1, respectively). This problem can be translated into a temporal domain in several ways. One version involves constructing a sequence of l-bit inputs by presenting the 2-bit inputs one bit at a time (i.e., in 2 time steps), followed by the l-bit output; then continuing with another input/output pair chosen at random. A sample input might be: 101000011110101... Here, the first and second bits are XOR-ed to produce the third; the fourth and fifth are XOR-ed to give the sixth; and so on. The inputs are concatenated and presented as an unbroken sequence. In the current version of the XOR problem, the input consisted of a sequence of 3,000 bits constructed in this manner. This input stream was presented to the network shown in Figure 2 (with 1 input unit, 2 hidden units, 1 output unit, and 2 context units), one bit at a time. The task of the network was, at each point in time, to predict the next bit in the sequence. That is, given the input sequence shown, where one bit at a time is presented, the correct output at corresponding points in time is shown below. input: 101000011110101... output: 01000011110101?... Recall that the actual input to the hidden layer consists of the input shown above, as well as a copy of the hidden unit activations from the previous cycle. The prediction is thus based not just on input from the world, but also on the network’s previous state (which is continuously passed back to itself on each cycle). Notice that, given the temporal structure of this sequence, it is only sometimes possible to predict the next item correctly. When the network has received the first bit-l in the example above-there is a 50% chance that the next bit will be a 1 (or a 0). When the network receives the second bit (0), however, it should then be possible to predict that the third will be the XOR, 1. When the fourth bit is presented, the fifth is not predictable. But from the fifth bit, the sixth can be predicted, and so on

FINDING STRUCTURE IN TIME 187 activated when the input elements alternate.Another way of viewing this is that the nctwork develops units which are scnsitive to high-and low-frc- quency inputs.This is a different solution than is found with feed-forward networks and simultaneously presented inputs.This suggests that problems may change their nature when cast in a temporal form.It is not clear that the solution will be easier or more difficult in this form,but it is an impor- tant lesson to realize that the solution may be different. In this simulation,the prediction task has been used in a way that is somewhat analogous to auto-association.Auto-association is a useful tech- nique for discovering the intrinsic structure possessed by a set of patterns. This occurs because the network must transform the patterns into more compact representations;it generally does so by exploiting redundancies in the patterns.Finding these redundancies can be of interest because of what they reveal about the similarity structure of the data set (cf.Cottrell et al. 1987;Elman Zipser,1988). In this simulation,the goal is to find the temporal structure of the XOR sequence.Simple auto-association would not work,since the task of simply reproducing the input at all points in time is trivially solvable and does not require sensitivity to sequential patterns.The prediction task is useful because its solution requires that the network be sensitive to temporal structure. STRUCTURE IN LETTER SEQUENCES One question which might be asked is whether the memory capacity of the network architecture employed here is sufficient to detect more complex sequential patterns than the XOR.The XOR pattern is simple in several respects.It involves single-bit inputs,requires a memory which extends only one bit back in time,and has only four different input patterns.More chal- lenging inputs would require multi-bit inputs of greater temporal extent, and a larger inventory of possible sequences.Variability in the duration of a pattern might also complicate the problem. An input sequence was devised which was intended to provide just these sorts of complications.The sequence was composed of six different 6-bit binary vectors.Although the vectors were not derived from real speech,one might think of them as representing speech sounds,with the six dimensions of the vector corresponding to articulatory features.Table 1 shows the vector for each of the six letters. The sequence was formed in two steps.First,the three consonants (b,d, g)were combined in random order to obtain a 1,000-letter sequence.Then, each consonant was replaced using the rules b-ba d-dii g一guuu

FINDING STRUCTURE IN TIME 187 activated when the input elements alternate. Another way of viewing this is that the network develops units which are sensitive to high- and low-frequency inputs. This is a different solution than is found with feed-forward networks and simultaneously presented inputs. This suggests that problems may change their nature when cast in a temporal form. It is not clear that the solution will be easier or more difficult in this form, but it is an important lesson to realize that the solution may be different. In this simulation, the prediction task has been used in a way that is somewhat analogous to auto-association. Auto-association is a useful technique for discovering the intrinsic structure possessed by a set of patterns. This occurs because the network must transform the patterns into more compact representations; it generally does so by exploiting redundancies in the patterns. Finding these redundancies can be of interest because of what they reveal about the similarity structure of the data set (cf. Cottrell et al. 1987; Elman & Zipser, 1988). In this simulation, the goal is to find the temporal structure of the XOR sequence. Simple auto-association would not work, since the task of simply reproducing the input at all points in time is trivially solvable and does not require sensitivity to sequential patterns. The prediction task is useful because its solution requires that the network be sensitive to temporal structure. STRUCTURE IN LETTER SEQUENCES One question which might be asked is whether the memory capacity of the network architecture employed here is sufficient to detect more complex sequential patterns than the XOR. The XOR pattern is simple in several respects. It involves single-bit inputs, requires a memory which extends only one bit back in time, and has only four different input patterns. More challenging inputs would require multi-bit inputs of greater temporal extent, and a larger inventory of possible sequences. Variability in the duration of a pattern might also complicate the problem. An input sequence was devised which was intended to provide just these sorts of complications. The sequence was composed of six different 6-bit binary vectors. Although the vectors were not derived from real speech, one might think of them as representing speech sounds, with the six dimensions of the vector corresponding to articulatory features. Table 1 shows the vector for each of the six letters. The sequence was formed in two steps. First, the three consonants (b, d, g) were combined in random order to obtain a 1 ,OOO-letter sequence. Then, each consonant was replaced using the rules b-bba d- dii g- guuu

188 ELMAN TABLE 1 Vector Definitions of Alphabet Consonant Vowel Interrupted High Back Voiced b 0 1 0 2 0 1 0 9 1 0 1 0 1 a 0 1 0 0 ’ 0 0 0 1 6 1 Thus,an initial sequence of the form dbgbddg...gave rise to the final se- quence diibaguuubadiidiiguuu...(each letter being represented by one of the above 6-bit vectors).The sequence was semi-random;consonants occurred randomly,but following a given consonant,the identity and number of following vowels was regular. The basic network used in the XOR simulation was expanded to provide for the 6-bit input vectors;there were 6 input units,20 hidden units,6 out- put units,and 20 context units. The training regimen involved presenting each 6-bit input vector,one at a time,in sequence.The task for the network was to predict the next input. (The sequence wrapped around,that the first pattern was presented after the last.)The network was trained on 200 passes through the sequence.It was then tested on another sequence that obeyed the same regularities,but created from a different initial randomizaiton. The error signal for part of this testing phase is shown in Figure 4.Target outputs are shown in parenthesis,and the graph plots the corresponding error for each prediction.It is obvious that the error oscillates markedly;at some points in time,the prediction is correct (and error is low),while at other points in time,the ability to predict correctly is quite poor.More pre- cisely,error tends to be high when predicting consonants,and low when predicting vowels. Given the nature of the sequence,this behavior is sensible.The conso- nants were ordered randomly,but the vowels were not.Once the network has received a consonant as input,it can predict the identity of the following vowel.Indeed,it can do more;it knows how many tokens of the vowel to expect.At the end of the vowel sequence it has no way to predict the next consonant;at these points in time,the error is high. This global error pattern does not tell the whole story,however.Remem- ber that the input patterns(which are also the patterns the network is trying to predict)are bit vectors.The error shown in Figure 4 is the sum squared error over all 6 bits.Examine the error on a bit-by-bit basis;a graph of the error for bits [1]and [4](over 20 time steps)is shown in Figure 5.There is a

188 ELMAN TABLE 1 Vector Definitions of Alphabet Consonant Vowel lnterruoted Hiah Bock Voiced b I 1 0 1 0 0 cl 1 0 1 1 0 g aI 0 1 0 1 0 10 1 : ; f 1 0 1 0 u 1 0 1 1 Thus, an initial sequence of the form dbgbddg., . gave rise to the final sequence diibaguuubadiidiiguuu . . . (each letter being represented by one of the above 6-bit vectors). The sequence was semi-random; consonants occurred randomly, but following a given consonant, the identity and number of following vowels was regular. The basic network used in the XOR simulation was expanded to provide for the 6-bit input vectors; there were 6 input units, 20 hidden units, 6 output units, and 20 context units. The training regimen involved presenting each 6-bit input vector, one at a time, in sequence. The task for the network was to predict the next input. (The sequence wrapped around, that the first pattern was presented after the last.) The network was trained on 200 passes through the sequence. It was then tested on another sequence that obeyed the same regularities, but created from a different initial randomizaiton. The error signal for part of this testing phase is shown in Figure 4. Target outputs are shown in parenthesis, and the graph plots the corresponding error for each prediction. It is obvious that the error oscillates markedly; at some points in time, the prediction is correct (and error is low), while at other points in time, the ability to predict correctly is quite poor. More precisely, error tends to be high when predicting consonants, and low when predicting vowels. Given the nature of the sequence, this behavior is sensible. The consonants were ordered randomly, but the vowels were not. Once the network has received a consonant as input, it can predict the identity of the following vowel. Indeed, it can do more; it knows how many tokens of the vowel to expect. At the end of the vowel sequence it has no way to predict the next consonant; at these points in time, the error is high. This global error pattern does not tell the whole story, however. Remember that the input patterns (which are also the patterns the network is trying to predict) are bit vectors. The error shown in Figure 4 is the sum squared error over all 6 bits. Examine the error on a bit-by-bit basis; a graph of the error for bits [l] and [4] (over 20 time steps) is shown in Figure 5. There is a