16 M. Liwicki. A Graves and H. Bunke 3. 3 Bidirectional Recurrent Neural Networks For many tasks it is useful to have access to future as well as past context. In handwriting recognition, for example, the identification of a given letter is helped by knowing the letters both to the right and left of it. Bidirectional Recurrent Neural Networks(BRNNs)[35] are able to access context in both directions along the input sequence. BRNNs contain two separate hidden layers, one of which processes the inputs forwards, while the other processes them backwards. Both hidden layers are connected to the output layer, which therefore has access to all past and future context of every point in the sequence Combining BRNNs and LSTM gives bidirectional LSTM (BLSTM)[42] 3.4 Connectionist Temporal Classification(CTC) Standard rnn objective functions require a presegmented input sequence with a separate target for every segment. This has limited the applicability of RNNs in domains such as cursive handwriting recognition, where segmentation is difficult to determine. Moreover, because the outputs of a standard rNn are a series of in- dependent, local classifications, some form of post processing is required to trans- form them into the desired label sequence. Connectionist Temporal Classification (CTC)(36, 34] is an RNN output layer specifically designed for sequence labeling tasks. It does not require the data to be presegmented, and it directly outputs a probability distribution over label sequences. CTC has been shown to outperform RNN-HMM hybrids in a speech recognition task [36] Q A CtC output layer contains as many units as there are labels in the task, plus ing the softmax function), so that they sum to 1 and are each in the range(0,(. additional or 'no label'unit. The output activations are normalized yk where ak is the unsquashed activation of output unit k at time t, and yk is the ac- tivation of the same unit after the softmax function is applied. The above activations are used to estimate the conditional probabilities p(k, tIx)of observing the label (or blank)with index k at time t in the input yk=p(, t Ix)
16 M. Liwicki, A. Graves, and H. Bunke 3.3 Bidirectional Recurrent Neural Networks For many tasks it is useful to have access to future as well as past context. In handwriting recognition, for example, the identification of a given letter is helped by knowing the letters both to the right and left of it. Bidirectional Recurrent Neural Networks (BRNNs) [35] are able to access context in both directions along the input sequence. BRNNs contain two separate hidden layers, one of which processes the inputs forwards, while the other processes them backwards. Both hidden layers are connected to the output layer, which therefore has access to all past and future context of every point in the sequence. Combining BRNNs and LSTM gives bidirectional LSTM (BLSTM) [42]. 3.4 Connectionist Temporal Classification (CTC) Standard RNN objective functions require a presegmented input sequence with a separate target for every segment. This has limited the applicability of RNNs in domains such as cursive handwriting recognition, where segmentation is difficult to determine. Moreover, because the outputs of a standard RNN are a series of independent, local classifications, some form of post processing is required to transform them into the desired label sequence. Connectionist Temporal Classification (CTC) [36,34] is an RNN output layer specifically designed for sequence labeling tasks. It does not require the data to be presegmented, and it directly outputs a probability distribution over label sequences. CTC has been shown to outperform RNN-HMM hybrids in a speech recognition task [36]. A CTC output layer contains as many units as there are labels in the task, plus an additional ‘blank’ or ‘no label’ unit. The output activations are normalized (using the softmax function), so that they sum to 1 and are each in the range (0; 1): ′ ′ = k a a t k t k t k e e y , where t k a is the unsquashed activation of output unit k at time t, and t k y is the activation of the same unit after the softmax function is applied. The above activations are used to estimate the conditional probabilities p(k,t | x) of observing the label (or blank) with index k at time t in the input sequence x: y p(k,t | x) t k =
Neural Networks for Handwriting recognition The conditional probability P(r I x)of observing a particular path through he lattice of label observations is then found by multiplying together the label and blank probabilities at every time step p(rlx)=lp(t, tlx)=lyr where I, is the label observed at time t along path I Paths are mapped onto label sequences lEL, where L denotes the set of all strings on the alphabet L of length sT, by an operator B that removes first the repeated labels, then the blanks. For example, both B(a, ,a, b,)and B(,a, a, -- ,a, b, b) yield the labeling(a, a, b). Since the paths are mutually exclusive, the conditional probability of a given labelling lEL is the sum of the probabilities of all the paths corresponding to it: p(|)=∑p(|x) The above step is what allows the network to be trained with unsegmented data The intuition is that, because we don' t know where the labels within a particular transcription will occur, we sum over all the places where they could occur In general, a large number of paths will correspond to the same label sequence so a naive calculation of the equation above is unfeasible. However, it can be effi- ciently evaluated using a graph-based algorithm, similar to the forward-backward algorithm for HMMs. more details about the ctC forward-back ward algorithm appear in [39] 3.5 Multidimensional recurrent Neural Networks Ordinary RNNs are designed for time-series and other data with a single spatio- emporal dimension. However the benefits of RNNS(such as robustness to input distortion, and flexible use of surrounding context) are also advantageous for mul bidimensional data, such as images and video sequences Multidimensional recurrent neural networks(MDRNNs)[43, 34), a special case of Directed Acyclic Graph RNNS [44], generalize the basic structure of RNNs to MDRNNS have as many recurrent connections as there are spatio-temporal dimen sions in the data. This allows them to access previous context information along all input directions. Multidirectional MDRNNs are the generalization of bidirectional RNNs to mul- tiple dimensions. For an n-dimensional data sequence, 2 different hidden layers are used to scan through the data in all directions. As with bidirectional RNNs, all
Neural Networks for Handwriting Recognition 17 The conditional probability p(π | x) of observing a particular path π through the lattice of label observations is then found by multiplying together the label and blank probabilities at every time step: ( | ) ( , | ) , 1 1 ∏ ∏ = = = = T t t T t t t p x p t x y π π π where π t is the label observed at time t along path π . Paths are mapped onto label sequences T l L≤ ∈ , where T L≤ denotes the set of all strings on the alphabet L of length ≤ T , by an operator B that removes first the repeated labels, then the blanks. For example, both B(a,−,a,b,−) and B(−,a,a,−,−,a,b,b) yield the labeling (a,a,b) . Since the paths are mutually exclusive, the conditional probability of a given labelling T l L≤ ∈ is the sum of the probabilities of all the paths corresponding to it: − ∈ = ( ) 1 ( | ) ( | ) B l p l x p x π π The above step is what allows the network to be trained with unsegmented data. The intuition is that, because we don’t know where the labels within a particular transcription will occur, we sum over all the places where they could occur. In general, a large number of paths will correspond to the same label sequence, so a naïve calculation of the equation above is unfeasible. However, it can be efficiently evaluated using a graph-based algorithm, similar to the forward-backward algorithm for HMMs. More details about the CTC forward-backward algorithm appear in [39]. 3.5 Multidimensional Recurrent Neural Networks Ordinary RNNs are designed for time-series and other data with a single spatiotemporal dimension. However the benefits of RNNs (such as robustness to input distortion, and flexible use of surrounding context) are also advantageous for multidimensional data, such as images and video sequences. Multidimensional recurrent neural networks (MDRNNs) [43, 34], a special case of Directed Acyclic Graph RNNs [44], generalize the basic structure of RNNs to multidimensional data. Rather than having a single recurrent connection, MDRNNs have as many recurrent connections as there are spatio-temporal dimensions in the data. This allows them to access previous context information along all input directions. Multidirectional MDRNNs are the generalization of bidirectional RNNs to multiple dimensions. For an n-dimensional data sequence, 2n different hidden layers are used to scan through the data in all directions. As with bidirectional RNNs, all
M. Liwicki. A Graves and H. Bunke he layers are connected to a single output layer, which therefore has access to context information in both directions along all dimensions Multidimensional LSTM (MDLSTM) is the generalization of bidirectional LSTM to multidimensional data 3.6 Hierarchical Subsampling Recurrent Neural Networks Hierarchical subsampling is a common technique in computer vision [45] and oth- er domains with large input spaces. The basic principle is to iteratively re- represent the data at progressively lower resolutions, using a hierarchy of feature extractors. The features extracted at each level are subsampled and used as inp to the next level. The number and complexity of the features typically increases as one climbs the hierarchy. This is much more efficient for high-resolution data than a single flat'feature extractor, since most of the computations are carried out or low resolution feature maps, rather than, for example, raw pixels. A well-known connectionist hierarchical subsampling architecture is Convolu tional Neural Networks [46]. Hierarchical subsampling is also possible with RNNS, and hierarchies of MDLSTM layers have been applied to offline handwrit ition [47]. Hierarchical subsampling with LSTM is equally useful for long ID sequences, such as raw speech data or online handwriting trajectories with a high sampling rate From the point of view of handwriting recognition, the most interesting aspect of hierarchical subsampling RNNs is that they can be applied directly to the raw input data(offline images or online point-sequences) without any normalization or feature extraction 4 Experiments The experiments have been performed with the freely available RNNLIB tool by Alex Graves. This tool implements the network architecture and furthermore pro- vides examples for the recognition of several scripts. 4.1 Comparison with HMMs on the lAM databases The aim of the first experiments was to evaluate the performance of the complete RNN handwriting recognition system, illustrated in Figure 6, for both online and offlne handwriting In particular we wanted to see how it compared to an HMM based system. The online and offline databases used the lAM-OnDB and the IAM-DB respectively(see above). Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms http://sourceforge.net/projects/rnnl
18 M. Liwicki, A. Graves, and H. Bunke the layers are connected to a single output layer, which therefore has access to context information in both directions along all dimensions. Multidimensional LSTM (MDLSTM) is the generalization of bidirectional LSTM to multidimensional data. 3.6 Hierarchical Subsampling Recurrent Neural Networks Hierarchical subsampling is a common technique in computer vision [45] and other domains with large input spaces. The basic principle is to iteratively rerepresent the data at progressively lower resolutions, using a hierarchy of feature extractors. The features extracted at each level are subsampled and used as input to the next level. The number and complexity of the features typically increases as one climbs the hierarchy. This is much more efficient for high-resolution data than a single `flat’ feature extractor, since most of the computations are carried out on low resolution feature maps, rather than, for example, raw pixels. A well-known connectionist hierarchical subsampling architecture is Convolutional Neural Networks [46]. Hierarchical subsampling is also possible with RNNs, and hierarchies of MDLSTM layers have been applied to offline handwriting recognition [47]. Hierarchical subsampling with LSTM is equally useful for long 1D sequences, such as raw speech data or online handwriting trajectories with a high sampling rate. From the point of view of handwriting recognition, the most interesting aspect of hierarchical subsampling RNNs is that they can be applied directly to the raw input data (offline images or online point-sequences) without any normalization or feature extraction. 4 Experiments The experiments have been performed with the freely available RNNLIB tool by Alex Graves.2 This tool implements the network architecture and furthermore provides examples for the recognition of several scripts. 4.1 Comparison with HMMs on the IAM Databases The aim of the first experiments was to evaluate the performance of the complete RNN handwriting recognition system, illustrated in Figure 6, for both online and offlne handwriting. In particular we wanted to see how it compared to an HMMbased system. The online and offline databases used were the IAM-OnDB and the IAM-DB respectively (see above). Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms. 2 http://sourceforge.net/projects/rnnl/
Neural Networks for Handwriting recognition CTC MDLSTM 4x50 cells …「■ Feedforward 20 x tanh 中墨于i3 Feedforward 6x tanh MDLSTM 4x2 cells 色sb Fig. 6 Complete RNN handwriting recognition system(here applied to offline Arabic data) To make the comparisons fair, the same online and offline preprocessing was for both the HMM and RNn systems. In addition, the same dictionaries and guage models were used for the two systems For all the experiments, the task was to transcribe the text lines in the test set, using the words in the dictionary. The basic performance measure was the word accuracy 1001- insertions +substitutions+delitions number_of_words_in_transcription here the number of word insertions substitutions and deletions is summed over whole test set. For the rNN system, we also recorded the character accuracy, defined as above except with characters instead of words
Neural Networks for Handwriting Recognition 19 Fig. 6 Complete RNN handwriting recognition system (here applied to offline Arabic data) To make the comparisons fair, the same online and offline preprocessing was used for both the HMM and RNN systems. In addition, the same dictionaries and language models were used for the two systems. For all the experiments, the task was to transcribe the text lines in the test set, using the words in the dictionary. The basic performance measure was the word accuracy: + + ⋅ − number of words in transcription insertions substitiutions delitions _ _ _ _ 100 1 where the number of word insertions, substitutions and deletions is summed over the whole test set. For the RNN system, we also recorded the character accuracy, defined as above except with characters instead of words
M. Liwicki. A Graves and H. Bunke Table 1 main results for online data System Word Accuracy Character Accuracy HMM CTC (BLSTM)79.7% 88.5% Table 2 Main results for offline data System Word Accuracy Character Accuracy HMM 64.5% CTC(BLSTM)74.1% 81.8% As can be seen from Tables I and 2, the rnn substantially outperformed the HMM on both databases To put these results in perspective, the Microsoft table PC handwriting recognizer [37] gave a word accuracy score of 71.32%o on the on line test set. This result is not directly comparable to our own, since the Microsoft system was trained on a different training set, and uses considerably more sophis- ticated language modeling than the HMM and RNN systems we implemented However, it indicates that the rNN-based recognizer is competitive with the best commercial systems for unconstrained handwriting 4.2 Recognition Performance of MdLsTM on Contest Data The MDLSTM system participated in three handwriting recognition contests at the ICDAR 2009(see the proceedings in [38]). The recognition tasks were based different scripts. In all cases, the systems had to recognize handwriting from un- Table 3 Summarized results from the online Arabic handwriting recognition competition EGIM HMM 52.67%0 640224ms Vision Objects 98.99%0 6941ms CTC (BLSTM95.70% 1377.22ms Table 4 Summarized results from the offline Arabic handwriting recognition competition Arab-Reader HMM 76. 66% 2583.64ms Multi-Stream HMM74,51% 4326981 CTC (MDLSTM) 81.06%0 371.6lms
20 M. Liwicki, A. Graves, and H. Bunke Table 1 Main results for online data System Word Accuracy Character Accuracy HMM 65.0% - CTC (BLSTM)79.7% 88.5% Table 2 Main results for offline data System Word Accuracy Character Accuracy HMM 64.5% - CTC (BLSTM)74.1% 81.8% As can be seen from Tables 1 and 2, the RNN substantially outperformed the HMM on both databases. To put these results in perspective, the Microsoft tablet PC handwriting recognizer [37] gave a word accuracy score of 71.32% on the online test set. This result is not directly comparable to our own, since the Microsoft system was trained on a different training set, and uses considerably more sophisticated language modeling than the HMM and RNN systems we implemented. However, it indicates that the RNN-based recognizer is competitive with the best commercial systems for unconstrained handwriting. 4.2 Recognition Performance of MDLSTM on Contest’ Data The MDLSTM system participated in three handwriting recognition contests at the ICDAR 2009 (see the proceedings in [38]). The recognition tasks were based on different scripts. In all cases, the systems had to recognize handwriting from unknown writers. Table 3 Summarized results from the online Arabic handwriting recognition competition System Word Accuracy Time/Image REGIM HMM 52.67% 6402.24 ms Vision Objects 98.99% 69.41 ms CTC (BLSTM)95.70% 1377.22 ms Table 4 Summarized results from the offline Arabic handwriting recognition competition System Word Accuracy Time/Image Arab-Reader HMM 76.66% 2583.64 ms Multi-Stream HMM74.51% 143269.81 ms CTC (MDLSTM) 81.06% 371.61 ms