Neural Networks for Handwriting Recognition hat-feature: this binary feature indicates whether a delayed stroke has been re moved at the same horizontal position as the considered point speed: the velocity is computed before resampling and then interpolated x-coordinate: the x-position is taken after high-pass filtering, i.e., after subtract- ing a moving average from the real horizontal position. y-coordinate: this feature represents the vertical position of the point after nor malization writing direction: here we have a pair of features, given by the cosine and sine of the angle between the line segment starting at the point and the x-axis curvature: similarly to the writing direction, this is a pair of features, given by the cosine and sine of the angle between the lines to the previous and the nex vicinity aspect: this feature is equal to the aspect of the trajectory(See Fig. 2) vicinity slope: this pair of features is given by the cosine and sine of the angle of the straight line from the first to the last vicinity point(see Fig. 2) vicinity curliness: this feature is defined as the length of the trajectory in the cinity divided by max(x(r):y(r))(see Fig. 2) vicinity linearity: here we use the average squared distance d2 of each point in he vicinity to the straight line from the first to the last vicinity point(see Fig. 2) X-vicipity 3×3 matri opus line baseline一 Fig 3 Pseudo offline features The features of the second class are all computed using a two-dimensional matrix B representing the offline version of the data. For each position the number of resolution image of the handwritten data. The following features are used a low- points on the trajectory of the strokes is stored. This can be seen as ascenders/descenders: these two features count the number of points above the corpus line(ascenders) and below the baseline(descenders). Only points which
Neural Networks for Handwriting Recognition 11 • hat-feature: this binary feature indicates whether a delayed stroke has been removed at the same horizontal position as the considered point. • speed: the velocity is computed before resampling and then interpolated. • x-coordinate: the x-position is taken after high-pass filtering, i.e., after subtracting a moving average from the real horizontal position. • y-coordinate: this feature represents the vertical position of the point after normalization. • writing direction: here we have a pair of features, given by the cosine and sine of the angle between the line segment starting at the point and the x-axis. • curvature: similarly to the writing direction, this is a pair of features, given by the cosine and sine of the angle between the lines to the previous and the next point. • vicinity aspect: this feature is equal to the aspect of the trajectory (See Fig. 2) • vicinity slope: this pair of features is given by the cosine and sine of the angle of the straight line from the first to the last vicinity point (see Fig. 2). • vicinity curliness: this feature is defined as the length of the trajectory in the vicinity divided by max(x(t); y(t)) (see Fig. 2). • vicinity linearity: here we use the average squared distance d² of each point in the vicinity to the straight line from the first to the last vicinity point (see Fig. 2). Fig. 3 Pseudo offline features The features of the second class are all computed using a two-dimensional matrix B representing the offline version of the data. For each position the number of points on the trajectory of the strokes is stored. This can be seen as a lowresolution image of the handwritten data. The following features are used: • ascenders/descenders: these two features count the number of points above the corpus line (ascenders) and below the baseline (descenders). Only points which
M. Liwicki. A Graves and H. Bunke have an x-coordinate in the vicinity of the current point are considered. Addition- ally the points must have a minimal distance to the lines to be considered as part of an ascender or descender. The corresponding distances are set to a predefined fraction of the corpus height. context map: the two-dimensional vicinity of the current point is transformed to 3x3 map. The number of black points in each region is taken as a feature val- ue. So we obtain altogether nine features of this type 2.3 Our Offline System To extract the feature vectors from the offline images, a sliding window approach is used. The width of the window is one pixel, and nine geometrical features are computed at each window position. Each text line image is therefore converted to a sequence of 9-dimensional vectors. The nine features are as follows The mean gray value of the pixels The center of gravity of the pixels The second order vertical moment of the center of gravity The positions of the uppermost and lowermost black pixels The rate of change of these positions(with respect to the neighboring windows) The number of black-white transitions between the uppermost and lowermost The proportion of black pixels between the uppermost and lowermost pixels For a more detailed description of the offline features, see [171 In the next phase indicated in Fig. 1, a classification system is applied which generates a list of candidates or even a recognition lattice. This step and the last step, the postprocessing, are described in the next section. 3 Neural Network Based Recognition The main focus of this chapter is the recently introduced Neural Network classifier based on crc combined with Bidirectional or Multidimensional lstm. this sec. tion describes the different aspects of the architecture and gives brief insights into the algorithms behind 3.1 Recurrent Neural Networks(RNNs) Recurrent neural networks(RNNs)are a connectionist model containing a self- connected hidden layer. One benefit of the recurrent connection is that a memory of previous inputs remains in the network's internal state, allowing it to make use
12 M. Liwicki, A. Graves, and H. Bunke have an x-coordinate in the vicinity of the current point are considered. Additionally the points must have a minimal distance to the lines to be considered as part of an ascender or descender. The corresponding distances are set to a predefined fraction of the corpus height. • context map: the two-dimensional vicinity of the current point is transformed to a 3×3 map. The number of black points in each region is taken as a feature value. So we obtain altogether nine features of this type. 2.3 Our Offline System To extract the feature vectors from the offline images, a sliding window approach is used. The width of the window is one pixel, and nine geometrical features are computed at each window position. Each text line image is therefore converted to a sequence of 9-dimensional vectors. The nine features are as follows: • The mean gray value of the pixels • The center of gravity of the pixels • The second order vertical moment of the center of gravity • The positions of the uppermost and lowermost black pixels • The rate of change of these positions (with respect to the neighboring windows) • The number of black-white transitions between the uppermost and lowermost pixels • The proportion of black pixels between the uppermost and lowermost pixels. For a more detailed description of the offline features, see [17]. In the next phase indicated in Fig. 1, a classification system is applied which generates a list of candidates or even a recognition lattice. This step and the last step, the postprocessing, are described in the next section. 3 Neural Network Based Recognition The main focus of this chapter is the recently introduced Neural Network classifier based on CTC combined with Bidirectional or Multidimensional LSTM. This Section describes the different aspects of the architecture and gives brief insights into the algorithms behind. 3.1 Recurrent Neural Networks (RNNs) Recurrent neural networks (RNNs) are a connectionist model containing a selfconnected hidden layer. One benefit of the recurrent connection is that a `memory' of previous inputs remains in the network's internal state, allowing it to make use
Neural Networks for Handwriting recognition of past context. Context plays an important role in handwriting recognition, as il- lustrated in Figure 4. Another important advantage of recurrency is that change of the internal state can be finely modulated by the recurrent weights, which builds in robustness to localized distortions of the input data ⅵtdwa Fig 4 Importance of context. The characters"ur"would be hard to recognize without the context of the word "entourage 3.2 Long short-Term Memory(LsTm) Unfortunately, the range of contextual information that standard RNNs can access is quite limited. The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponential ly as it cycles around the network's recurrent connections, and is repeatedly scaled by the connection weights. In practice this shortcoming(referred to in the litera ture as the vanishing gradient problem) makes it hard for an RNn to bridge gaps of more than about 10 time steps between relevant input and target events. Long Short-Term Memory (LSTM) is an RNN architecture specifically de signed to address the vanishing gradient problem. An LSTM hidden layer consists of multiple recurrently connected subnets, known as memory blocks. Each block contains a set of internal units, or cells, whose activation is controlled by three multiplicative units: the input gate, forget gate and output gate. Figure 5 provides a detailed illustration of an LSTM memory block with a single cel The effect of the gates is to allow the cells to store and access information over long periods of time. For example, as long as the input gate emaIns osed (i.e. has an activation close to O), the activation of the cell will not be overwritten by the new inputs arriving in the network. Similarly, the cell activation is only avail- able to the rest of the network when the output gate is open, and the cells recu rent connection is switched on and off by the forget gate
Neural Networks for Handwriting Recognition 13 of past context. Context plays an important role in handwriting recognition, as illustrated in Figure 4. Another important advantage of recurrency is that the rate of change of the internal state can be finely modulated by the recurrent weights, which builds in robustness to localized distortions of the input data. Fig. 4 Importance of context. The characters “ur” would be hard to recognize without the context of the word “entourage”. 3.2 Long Short-Term Memory (LSTM) Unfortunately, the range of contextual information that standard RNNs can access is quite limited. The problem is that the influence of a given input on the hidden layer, and therefore on the network output, either decays or blows up exponentially as it cycles around the network's recurrent connections, and is repeatedly scaled by the connection weights. In practice this shortcoming (referred to in the literature as the vanishing gradient problem) makes it hard for an RNN to bridge gaps of more than about 10 time steps between relevant input and target events. Long Short-Term Memory (LSTM) is an RNN architecture specifically designed to address the vanishing gradient problem. An LSTM hidden layer consists of multiple recurrently connected subnets, known as memory blocks. Each block contains a set of internal units, or cells, whose activation is controlled by three multiplicative units: the input gate, forget gate and output gate. Figure 5 provides a detailed illustration of an LSTM memory block with a single cell. The effect of the gates is to allow the cells to store and access information over long periods of time. For example, as long as the input gate remains closed (i.e., has an activation close to 0), the activation of the cell will not be overwritten by the new inputs arriving in the network. Similarly, the cell activation is only available to the rest of the network when the output gate is open, and the cell's recurrent connection is switched on and off by the forget gate
M. Liwicki. A Graves and H. Bunke NET OUTPUT OUTPUT GATE CELL FORGET GATE INPUT GATE NET INPUT Fig 5 LSTM memory block with one cell The mathematical background of LSTM is described in depth in [40, 41, 34].A short description follows. A conventional recurrent multi-layer Perceptron net work(MLP) contains a hidden layer where all neurons of the hidden layer are ful ly connected with all neurons of the same layer(the recurrent connections). The activation of a single cell at the timestamp t is a weighted sum of the inputs xi' the weighted sum of the outputs of the previous timestamp bh-. This can be ex pressed as follows(or in a matrix form a'=∑wx+∑wb2W·x+W,B b=f(a) B′=f(A) Since the outputs of the previous timestamp are just calculated by the squashing function of the corresponding cell activations, the influence of the network input in the previous time stamp can be considered as smaller, since it has been weighted already a second time. Thus the overall network activation can be rough ly rewritten as →A=8(X,WX-,W2x2…,WX)
14 M. Liwicki, A. Graves, and H. Bunke Fig. 5 LSTM memory block with one cell The mathematical background of LSTM is described in depth in [40,41,34]. A short description follows. A conventional recurrent multi-layer Perceptron network (MLP) contains a hidden layer where all neurons of the hidden layer are fully connected with all neurons of the same layer (the recurrent connections). The activation of a single cell at the timestamp t is a weighted sum of the inputs xi t plus the weighted sum of the outputs of the previous timestamp bh (t-1). This can be expressed as follows (or in a matrix form): −1 −1 = + = ⋅ + ⋅ t h t i t h h t i i t a w x w b W X W B ( ) ( ) t t t t B f A b f a = = Since the outputs of the previous timestamp are just calculated by the squashing function of the corresponding cell activations, the influence of the network input in the previous time stamp can be considered as smaller, since it has been weighted already a second time. Thus the overall network activation can be roughly rewritten as: ( , , , , ) 1 2 2 1 A g X W X W X W X t h t h t h t = t − −
Neural Networks for Handwriting recognition where X is the overall net input at timestamp t and Wn is the weight matrix of the hidden layer. Note that for clarity reasons we use this abbreviated form of the complex formula, where the input weights do not directly appear (all is hidden in he function g(...). This formula reveals that the influence of earlier time stamps I-n vanishes rapidly, as the time difference n appears in the exponent of the weight matrix. Since all values of the weight matrix Wh are smaller than 1, the n-th power of Wh is close to zero Introducing the LSTM cell brings in three new cells which all get the weighted sum of the outputs of the hidden layer in the previous timestamp as an input, i.e., for the input gate Wi. X+Wh. B+wS where se is the cell state of the previous timestamp and Wi. and Wh, are the weights for the current net input and the hidden layer output of the previous time- stamp, respectively. The activation of the forget gate is: a=WaX+W·B-+wos which is the same formula just with other weights(those trained for the forget gate). The cell activation is usually calculated by a'=W.X'tW.bl- However, the cell state is then weighted with the outputs of the two gate cells s=o(a,g(a)+o(a)s where o indicates that the sigmoid function is used as a squashing function for the gates and g( is cell's activation function. As the sigmoid function often returns a value close to zero or one, the formula can be interpreted as -s=[Oorl]g(a)+[Oorl]s- or in words: the cell state is either depending on the input activation(if the input gates opens, i.e., the first weight is close to 1)or on the previous cell state(if the e). This nables the LSTM-cell to bridge over long time periods. The value of the output gate is calculated similarly to the other gates, i.e. W.X+W、.B and the final cell output i b=o(ah(s) which again is close to zero or the usual output of the cell h(...)
Neural Networks for Handwriting Recognition 15 where Xt is the overall net input at timestamp t and Wh is the weight matrix of the hidden layer. Note that for clarity reasons we use this abbreviated form of the complex formula, where the input weights do not directly appear (all is hidden in the function g(…)). This formula reveals that the influence of earlier time stamps t-n vanishes rapidly, as the time difference n appears in the exponent of the weight matrix. Since all values of the weight matrix Wh are smaller than 1, the n-th power of Wh is close to zero. Introducing the LSTM cell brings in three new cells which all get the weighted sum of the outputs of the hidden layer in the previous timestamp as an input, i.e., for the input gate: 1 , 1 , , − − = ⋅ + ⋅ + t c t h t i t c a W X W B w s ι ι ι ι where sc t-1 is the cell state of the previous timestamp and Wi,t and Wh,t are the weights for the current net input and the hidden layer output of the previous timestamp, respectively. The activation of the forget gate is: 1 , 1 , , − − = ⋅ + ⋅ + t c t h t i t c a W X W B w s θ θ θ θ which is the same formula just with other weights (those trained for the forget gate). The cell activation is usually calculated by: 1 , , − = ⋅ + ⋅ t h c t i c t ac W X W B However, the cell state is then weighted with the outputs of the two gate cells: 1 ( ) ( ) ( ) − = + t c t t c t t c s a g a a s θ σ ι σ where σ indicates that the sigmoid function is used as a squashing function for the gates and g() is cell’s activation function. As the sigmoid function often returns a value close to zero or one, the formula can be interpreted as: 1 [0 1] ( ) [0 1] − = + t c t c t c s or g a or s or in words: the cell state is either depending on the input activation (if the input gates opens, i.e., the first weight is close to 1) or on the previous cell state (if the forget gate opens, i.e., the second weight is close to one). This particular property enables the LSTM-cell to bridge over long time periods. The value of the output gate is calculated similarly to the other gates, i.e.: t c t h t i t c a W X W B w s ω ω ω ,ω 1 = , ⋅ + , ⋅ + − and the final cell output is: ( ) ( ) t c t t c b a h s =σ ω which again is close to zero or the usual output of the cell h(…)