6 M. Liwicki. A Graves and H. Bunke the movement of the pen-tip, is captured, while in the off-line case only an ima of the text is available. Because of the greater ease of extracting relevant features, online recognition generally yields better results [1]. Another crucial division is hat between the recognition of isolated characters or words, and the recognition of whole lines of text. Unsurprisingly, the latter is substantially harder, and the ex cellent results that have been obtained for e. g. digit and character recognition [2] 3] have never been matched for complete lines. Lastly, handwriting recognition can be split into cases where the writing style is constrained in some way-for example, only hand printed characters are allowed-and the more challenging scenario where it is unconstrained. Despite more than 40 years of handwriting recognition research [2],B3],[4],[5], developing a reliable, general-purpose sys tem for unconstrained text line recognition remains an open problem. 1.1 State-of-the-Art A well known testbed for isolated handwritten character recognition is the UNIPEN database [6]. Systems that have been found to perform well on UNIPEN include: a writer-independent approach based on hidden Markov models [7]; a hy id technique called cluster generative statistical dynamic time warping (CSDTW)[8], which combines dynamic time warping with HMMs and embeds clustering and statistical sequence modeling in a single feature space; and a sup- port vector machine with a novel Gaussian dynamic time warping kernel [9]. Typ- ical error rates on UNIPEN range from 3% for digit recognition, to about 10% for lower case character recognition Similar techniques can be used to classify isolated words, and this has giv good results for small vocabularies(e.g, a writer dependent word error rate of bout 4.5% for 32 words [10). However an obvious drawback of whole word classification is that it does not scale to large vocabularies For large vocabulary recognition tasks, such as those considered in this chapter the usual approach is to recognize individual characters and map them onto com- plete words using a dictionary. Naively, we could do this by presegmenting words into characters and classifying each segment. However, segmentation is difficult for cursive or unconstrained text, unless the words have already been recognized This creates a circular dependency between segmentation and recognition that is often referred to as Sayre's paradox [11]. Nonetheless, approaches have been pro- posed where segmentation is carried out before recognition. Some techniques for character segmentation, based on unsupervised learning and data-driven methods are given in [3]. Other strategies first segment the text into basic strokes, rather han characters. The stroke boundaries may be defined in various ways, such as the minima of the velocity, the minima of the y-coordinates, or the points of max- imum curvature. For example, one online approach first segments the data at the minima of the y-coordinates then applies self-organizing maps [12]. Another, off- line, approach[ 13] uses the minima of the vertical histogram for an initial estima- tion of the character boundaries and then applies various heuristics to improve the
6 M. Liwicki, A. Graves, and H. Bunke the movement of the pen-tip, is captured, while in the off-line case only an image of the text is available. Because of the greater ease of extracting relevant features, online recognition generally yields better results [1]. Another crucial division is that between the recognition of isolated characters or words, and the recognition of whole lines of text. Unsurprisingly, the latter is substantially harder, and the excellent results that have been obtained for e.g. digit and character recognition [2], [3] have never been matched for complete lines. Lastly, handwriting recognition can be split into cases where the writing style is constrained in some way—for example, only hand printed characters are allowed—and the more challenging scenario where it is unconstrained. Despite more than 40 years of handwriting recognition research [2], [3], [4], [5], developing a reliable, general-purpose system for unconstrained text line recognition remains an open problem. 1.1 State-of-the-Art A well known testbed for isolated handwritten character recognition is the UNIPEN database [6]. Systems that have been found to perform well on UNIPEN include: a writer-independent approach based on hidden Markov models [7]; a hybrid technique called cluster generative statistical dynamic time warping (CSDTW) [8], which combines dynamic time warping with HMMs and embeds clustering and statistical sequence modeling in a single feature space; and a support vector machine with a novel Gaussian dynamic time warping kernel [9]. Typical error rates on UNIPEN range from 3% for digit recognition, to about 10% for lower case character recognition. Similar techniques can be used to classify isolated words, and this has given good results for small vocabularies (e.g., a writer dependent word error rate of about 4.5% for 32 words [10]). However an obvious drawback of whole word classification is that it does not scale to large vocabularies. For large vocabulary recognition tasks, such as those considered in this chapter, the usual approach is to recognize individual characters and map them onto complete words using a dictionary. Naively, we could do this by presegmenting words into characters and classifying each segment. However, segmentation is difficult for cursive or unconstrained text, unless the words have already been recognized. This creates a circular dependency between segmentation and recognition that is often referred to as Sayre’s paradox [11]. Nonetheless, approaches have been proposed where segmentation is carried out before recognition. Some techniques for character segmentation, based on unsupervised learning and data-driven methods, are given in [3]. Other strategies first segment the text into basic strokes, rather than characters. The stroke boundaries may be defined in various ways, such as the minima of the velocity, the minima of the y-coordinates, or the points of maximum curvature. For example, one online approach first segments the data at the minima of the y-coordinates then applies self-organizing maps [12]. Another, offline, approach [13] uses the minima of the vertical histogram for an initial estimation of the character boundaries and then applies various heuristics to improve the segmentation
Neural Networks for Handwriting Recognition A more satisfactory solution to Sayre's paradox would be to segment and rec ognize at the same time. Hidden Markov models (HMMs) are able to do this which is one reason for their popularity for unconstrained handwriting [14],[15] [16],[17], [18],[19]. The idea of applying HMMs to handwriting recognition was originally motivated by their success in speech recognition [20), where a similar conflict exists between recognition and segmentation. Over the years, numerous refinements of the basic HMM approach have been proposed, such as the writer independent system considered in [7], which combines point oriented and stroke oriented input features. However HMMs have several well-known drawbacks. One of these is that the assume the probability of each observation depends only on the current state, which makes contextual effects difficult to model. Another is that HMMs are ge- nerative, when discriminative models generally give better performance labeling and classification tasks Recurrent neural networks(RNNs) do not suffer from these limitations, and ould therefore seem a promising alternative to HMMs. However the application of RNNs alone to handwriting recognition have so far been limited to isolated cha- racter recognition(e. g. [21)). One reason for this is that the traditional neural work objective functions require a separate training signal for every point in the input sequence, which in turn requires presegmented data. A more successful use of neural networks for handwriting recognition has been to combine them with HMMs in the so-called hybrid approach [22],[23]. A varie- ty of network architectures have been tried for hybrid handwriting recognition,in- cluding multilayer perceptrons [24],[25], time delay neural networks (TDNNS) [18],[26], [27], and RNNs [28],(29],[30]. However, although hybrid models al- leviate the difficulty of introducing context to HMMs, they still suffer from many of the drawbacks of HMMs, and they do not realize the full potential of RNNs for 1.2 Contribution This chapter describes a recently introduced alternative approach, in which a single RNN is trained directly for sequence labeling. The network uses the connectionist temporal classification (CTC) combined with bidirectional Long Short-Term Memory(BLsTM) architecture, which provides access to long range input context in both directions. a further enhancement which allows the network to work in multiple dimensions will be presented in this chapter. The so-called Multidimen- sional LSTM(MDLSTM) is very successful even on raw pixel data. The rest of this Chapter is organized as follows. Section 2 presents the andwritten data and the feature extraction techniques. Subsequently, Section 3 describes the novel neural network classifier. Experimental results are presented in Section 4. Finally, Section 5 concludes this chapter
Neural Networks for Handwriting Recognition 7 A more satisfactory solution to Sayre’s paradox would be to segment and recognize at the same time. Hidden Markov models (HMMs) are able to do this, which is one reason for their popularity for unconstrained handwriting [14], [15], [16], [17], [18], [19]. The idea of applying HMMs to handwriting recognition was originally motivated by their success in speech recognition [20], where a similar conflict exists between recognition and segmentation. Over the years, numerous refinements of the basic HMM approach have been proposed, such as the writer independent system considered in [7], which combines point oriented and stroke oriented input features. However, HMMs have several well-known drawbacks. One of these is that they assume the probability of each observation depends only on the current state, which makes contextual effects difficult to model. Another is that HMMs are generative, when discriminative models generally give better performance labeling and classification tasks. Recurrent neural networks (RNNs) do not suffer from these limitations, and would therefore seem a promising alternative to HMMs. However the application of RNNs alone to handwriting recognition have so far been limited to isolated character recognition (e.g. [21]). One reason for this is that the traditional neural network objective functions require a separate training signal for every point in the input sequence, which in turn requires presegmented data. A more successful use of neural networks for handwriting recognition has been to combine them with HMMs in the so-called hybrid approach [22], [23]. A variety of network architectures have been tried for hybrid handwriting recognition, including multilayer perceptrons [24], [25], time delay neural networks (TDNNs) [18], [26], [27], and RNNs [28], [29], [30]. However, although hybrid models alleviate the difficulty of introducing context to HMMs, they still suffer from many of the drawbacks of HMMs, and they do not realize the full potential of RNNs for sequence modeling. 1.2 Contribution This chapter describes a recently introduced alternative approach, in which a single RNN is trained directly for sequence labeling. The network uses the connectionist temporal classification (CTC) combined with bidirectional Long Short-Term Memory (BLSTM) architecture, which provides access to long range input context in both directions. A further enhancement which allows the network to work in multiple dimensions will be presented in this chapter. The so-called Multidimemsional LSTM (MDLSTM) is very successful even on raw pixel data. The rest of this Chapter is organized as follows. Section 2 presents the handwritten data and the feature extraction techniques. Subsequently, Section 3 describes the novel neural network classifier. Experimental results are presented in Section 4. Finally, Section 5 concludes this chapter
M. Liwicki. A Graves and H. Bunke In mid Anal recorded data mortd his pami y and nowak preprocessin aw AV processed lines sd-aprie Anglesey feature extraction feature vectors classification recogniton In8 mid-april Anglesay candidates Its mid-april Anglesey I a mid-April Anglesey postprocessing recognition In mid-april Anglesey result Fig. I Processing steps of the handwriting nition system 2 Data processing As stated above, handwritten data can be acquired for two formats, online and of- fline format. In this section typical preprocessing and feature extraction techniques are presented. Those techniques have been applied for our experimet
8 M. Liwicki, A. Graves, and H. Bunke Fig. 1 Processing steps of the handwriting recognition system 2 Data Processing As stated above, handwritten data can be acquired for two formats, online and offline format. In this section typical preprocessing and feature extraction techniques are presented. Those techniques have been applied for our experiments
Neural Networks for Handwriting recognition The online and offline databases used are the IAM-OnDB 31 and the IaM DB [32] respectively. Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms 2.1 General Processing Steps A recognition system for unconstrained Roman script is usually divided into con- secutive units which iteratively process the handwritten input data to finally obtain he transcription. The main units are illustrated in Fig. I and summarized in this section. Certainly, there are differences between offline and online processing, but he principles are the same. Only the methodology for performing the individual steps differs. First, preprocessing steps are applied to reduce noise in the raw data. The input is raw handwritten data and the output usually consists of extracted text lines. The amount of effort that need to be invested into the preprocessing depends on the given data. If the data have been acquired from a system that does not produce any noise and only single words have been recorded, there is nothing to do in this step But usually the data contains noise which need to be removed to improve the qual ity of the handwriting, e.g., by means of image enhancement. The offline images are furthermore binarized and the online data, which usually contain noisy points and gaps within strokes, is processed with some heuristics to recover from these artifacts. These operations are described in Ref. [33]. The cleaned text data is then automatically divided into lines using some simple heuristics Next, the data is normalized, i. e, it is attempted to remove writer-specific cha- racteristics of the handwriting to make writings from different authors looking more similar to each other. This is a very important step in any handwriting rec- ognition system, because the writing styles of the writers differ with respect to skew, slant, height, and width of the characters. In the literature there is no stan dard way of normalizing the data, but many systems use similar techniques. First, the text line is corrected in regard to its skew, i.e., it is rotated, so that the baseline is parallel to the x-axis. Then, slant correction is performed so that the slant be- comes upright. The next important step is the computation of the baseline and the corpus line. These two lines divide the text into three areas: the upper area, which mainly contains the ascenders of the letters; the middle area, where the corpus of the letters is present; and the lower area with the descenders of some letters. These hree areas are normalized to predefined heights. Often, some additional normali- zation steps are performed, depending on the domain. In offline recognition, thi ning and binarization may be applied. In online recognition the delayed strokes, e.g., the crossing of a"t"or the dot of an"i, are usually removed, and equidistant resampling is applied The databases and benchmark tasks are available on http://www.iamunibe.ch/fki/databases
Neural Networks for Handwriting Recognition 9 The online and offline databases used are the IAM-OnDB [31] and the IAMDB [32] respectively. Note that these do not correspond to the same handwriting samples: the IAM-OnDB was acquired from a whiteboard, while the IAM-DB consists of scanned images of handwritten forms.1 2.1 General Processing Steps A recognition system for unconstrained Roman script is usually divided into consecutive units which iteratively process the handwritten input data to finally obtain the transcription. The main units are illustrated in Fig. 1 and summarized in this section. Certainly, there are differences between offline and online processing, but the principles are the same. Only the methodology for performing the individual steps differs. First, preprocessing steps are applied to reduce noise in the raw data. The input is raw handwritten data and the output usually consists of extracted text lines. The amount of effort that need to be invested into the preprocessing depends on the given data. If the data have been acquired from a system that does not produce any noise and only single words have been recorded, there is nothing to do in this step. But usually the data contains noise which need to be removed to improve the quality of the handwriting, e.g., by means of image enhancement. The offline images are furthermore binarized and the online data, which usually contain noisy points and gaps within strokes, is processed with some heuristics to recover from these artifacts. These operations are described in Ref. [33]. The cleaned text data is then automatically divided into lines using some simple heuristics. Next, the data is normalized, i.e., it is attempted to remove writer-specific characteristics of the handwriting to make writings from different authors looking more similar to each other. This is a very important step in any handwriting recognition system, because the writing styles of the writers differ with respect to skew, slant, height, and width of the characters. In the literature there is no standard way of normalizing the data, but many systems use similar techniques. First, the text line is corrected in regard to its skew, i.e., it is rotated, so that the baseline is parallel to the x-axis. Then, slant correction is performed so that the slant becomes upright. The next important step is the computation of the baseline and the corpus line. These two lines divide the text into three areas: the upper area, which mainly contains the ascenders of the letters; the middle area, where the corpus of the letters is present; and the lower area with the descenders of some letters. These three areas are normalized to predefined heights. Often, some additional normalization steps are performed, depending on the domain. In offline recognition, thinning and binarization may be applied. In online recognition the delayed strokes, e.g., the crossing of a “t” or the dot of an “i”, are usually removed, and equidistant resampling is applied. 1 The databases and benchmark tasks are available on http://www.iam.unibe.ch/fki/databases
M. Liwicki. A Graves and H. Bunke e Subsequently, features are extracted from the normalized data. This particular tep is needed because the recognizers need numerical data as their input Howev r,no standard method for computing the features exists in the literature. One common method in offline recognition of handwritten text lines is the use of a iding window moving in the writing direction over the text. Features are extracted at every window position, resulting in a sequence of feature vectors. In the case of online recognition the points are already available in a time-ordered sequence, which makes it easier to get a sequence of feature vectors in writing or- der. If there is a fixed size of the input pattern, such as in character or word recog- nition, one feature vector of a constant size can be extracted for each pattern. Fig. 2 Features of the vicinity 2.2 Our Online system In the system described in this chapter state-of-the-art feature extraction methods are applied to extract the features from the preprocessed data. The feature set inp to the online recognizer consists of 25 features which utilize information from both the real online data stored in XML format, and pseudo offline information automatically generated from the online data For each(x, y)-coordinate recorded by the acquisition device a set of 25 features are extracted, resulting in a sequence of 25-dimensional vectors for each given text line. These features can be divided into two classes. The first class consists of features extracted for each point by considering the neighbors with respect to time. The second class takes the offline matrix representation into account, i. e, it is based on spatial information. The fea tures of the first class are the following: pen-up/pen-down: a boolean variable indicating whether the pen-tip touches the board or not. Consecutive strokes are connected with straight lines for which his feature has the value false
10 M. Liwicki, A. Graves, and H. Bunke Subsequently, features are extracted from the normalized data. This particular step is needed because the recognizers need numerical data as their input. However, no standard method for computing the features exists in the literature. One common method in offline recognition of handwritten text lines is the use of a sliding window moving in the writing direction over the text. Features are extracted at every window position, resulting in a sequence of feature vectors. In the case of online recognition the points are already available in a time-ordered sequence, which makes it easier to get a sequence of feature vectors in writing order. If there is a fixed size of the input pattern, such as in character or word recognition, one feature vector of a constant size can be extracted for each pattern. Fig. 2 Features of the vicinity 2.2 Our Online System In the system described in this chapter state-of-the-art feature extraction methods are applied to extract the features from the preprocessed data. The feature set input to the online recognizer consists of 25 features which utilize information from both the real online data stored in XML format, and pseudo offline information automatically generated from the online data. For each (x, y)-coordinate recorded by the acquisition device a set of 25 features are extracted, resulting in a sequence of 25-dimensional vectors for each given text line. These features can be divided into two classes. The first class consists of features extracted for each point by considering the neighbors with respect to time. The second class takes the offline matrix representation into account, i.e., it is based on spatial information. The features of the first class are the following: • pen-up/pen-down: a boolean variable indicating whether the pen-tip touches the board or not. Consecutive strokes are connected with straight lines for which this feature has the value false