FINDING STRUCTURE IN TIME 189 (9) (d) (d) (g) (b) (i) Ku) 41) (u) (i) Ka) (a) u) 0x200y-1士 Figure 4.Graph of root mean squared error in letter prediction task.Labels indicate the correct output prediction at each point in time.Error is computed over the entire output vector. striking difference in the error patterns.Error on predicting the first bit is consistently lower than error for the fourth bit,and at all points in time. Why should this be so? The first bit corresponds to the features Consonant;the fourth bit cor- responds to the feature High.It happens that while all consonants have the same value for the feature Consonant,they differ for High.The network has learned which vowels follow which consonants;this is why error on vowels is low.It has also learned how many vowels follow each consonant. An interesting corollary is that the network also knows how soon to expect the next consonant.The network cannot know which consonant,but it can predict correctly that a consonant follows.This is why the bit patterns for Consonant show low error,and the bit patterns for High show high error. (It is this behavior which requires the use of context units;a simple feed- forward network could learn the transitional probabilities from one input to the next,but could not learn patterns that span more than two inputs.)
FINDING STRUCTURE IN TIME r ‘0 i: -I 189 i) Figure 4. Graph of root mean squared error in letter prediction task. Labels indicate the correct output prediction at each point in time. Error is computed over the entire output vector. striking difference in the error patterns. Error on predicting the first bit is consistently lower than error for the fourth bit, and at all points in time. Why should this be so? The first bit corresponds to the features Consonant; the fourth bit corresponds to the feature High. It happens that while all consonants have the same value for the feature Consonant, they differ for High. The network has learned which vqwels follow which consonants; this is why error on vowels is low. It has also learned how many vowels follow each consonant. An interesting corollary is that the network also knows how soon to expect the next consonant. The network cannot know which consonant, but it can predict correctly that a consonant follows. This is why the bit patterns for Consonant show low error, and the bit patterns for High show high error. (It is this behavior which requires the use of context units; a simple feedforward network could learn the transitional probabilities from one input to the next, but could not learn patterns that span more than two inputs.)
(9 ,wo心ase Flgure 5(a).Graph of root mean squared error in letter prediction task.Error is computed on bit 1,representing the feature CONSONANTAL. (d) (d) (g) 1g) d)1) b2对0y士 Flgure 5(b).Graph of root mean squared error in letter prediction task.Error is computed on bit 4,representing the feature HIGH. 190
Figure 5 (a). Graph of root mean squared error in letter prediction task. Error is computed an bit 1. representing the feature CONSONANTAL. ,(d) Figure S (b). Graph of root mean squared error in letter prediction tosk. Error is computed on bit 4, representing the feature HIGH. 190