A2 M ×)M A2 M A(+[ FIGURE 18.3 (a)a two-stage pipelinable time-invariant lattice digital filter. If multiplication and addition operations equire 2 and 1 time units, respectively, then this data flow graph can achieve a sar period of 10 time units(which corresponds to the critical path M1→A2→M2→A1→M3→A→A4).(b) The pipelined/retimed lattice digital filter can achieve a sampling period of 2 time units. Unfolding The unfolding transformation is similar to loop unrolling. In J-unfolding, each node is replaced by J and each edge is replaced by J edges. The J-unfolded data flow graph executes J iterations of the or algorithm [ Parhi, 1991] The unfolding transformation can unravel the hidden concurrency in a data flow program. The achievable iteration period for a J-unfolded data flow graph is 1/J times the critical path length of the unfolded data flow graph. By exploiting interiteration concurrency, unfolding can lead to a lower iteration period in the context of a software programmable multiprocessor implementation. The unfolding transformation can also be applied in the context of hardware design. If we apply an unfolding transformation on a(word-serial)nonrecursive algorithm, the resulting data flow graph represents a word parallel (or simply parallel) algorithm that processes multiple samples or words in parallel every clock cycle If we apply 2-unfolding to the 3-tap FIR filter in Fig. 18. 1(a), we can obtain the data flow graph of Fig 18.2 c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Unfolding The unfolding transformation is similar to loop unrolling. In J-unfolding, each node is replaced by J nodes and each edge is replaced by J edges. The J-unfolded data flow graph executes J iterations of the original algorithm [Parhi, 1991]. The unfolding transformation can unravel the hidden concurrency in a data flow program. The achievable iteration period for a J-unfolded data flow graph is 1/J times the critical path length of the unfolded data flow graph. By exploiting interiteration concurrency, unfolding can lead to a lower iteration period in the context of a software programmable multiprocessor implementation. The unfolding transformation can also be applied in the context of hardware design. If we apply an unfolding transformation on a (word-serial) nonrecursive algorithm, the resulting data flow graph represents a wordparallel (or simply parallel) algorithm that processes multiple samples or words in parallel every clock cycle. If we apply 2-unfolding to the 3-tap FIR filter in Fig. 18.1(a), we can obtain the data flow graph of Fig. 18.2. FIGURE 18.3 (a) A two-stage pipelinable time-invariant lattice digital filter. If multiplication and addition operations require 2 and 1 time units, respectively, then this data flow graph can achieve a sampling period of 10 time units (which corresponds to the critical path M1 Æ A2 Æ M2 Æ A1 Æ M3 Æ A3 Æ A4). (b) The pipelined/retimed lattice digital filter can achieve a sampling period of 2 time units
y3 y2 y, yo X(2k+1) S(2K+1) y(2k+1) s3828s IGURE 18.4(a)A least-significant-bit first bit-serial adder for word length of 4;(b)a digit-serial adder with digit size 2 obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit. Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level. Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit size J. The digit size represents the number of bits processed per clock cycle git-serial architecture is locked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/C bound much before reaching the computation bound of the bit-serial program). Because the digit-serial program processes Jbits per clock cycle the effective bit rate of the digit-serial architecture is Times higher. A simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded oy a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process a digit of each word (not the entire word) Folding Transformation The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation is simpler, the folding transformation is more difficult [Parhi et al., 199 The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible ase transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be folded. For the special case of regular data flow graphs and for linear space-time mappings, the folding tranformation reduces to systolic array design. In the folded architecture, each edge in the algorithm data flow graph unication the hardware architecture data flow graph. Consider an edge U- Vin the algorithm data flow graph with sociated number of delays i( U-v). Let the tasks U and V be mapped to the hardware units Hy and H, respectively. Assume that N time partitions are available, i. e, the iteration period is N. A modulo operation determines the time partition. For example, the time unit 18 for N= 4 corresponds to time partition 18 modulo c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level. Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit size J. The digit size represents the number of bits processed per clock cycle. The digit-serial architecture is clocked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/O bound much before reaching the computation bound of the bit-serial program). Because the digit-serial program processes J bits per clock cycle the effective bit rate of the digit-serial architecture is J times higher. A simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded by a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process a digit of each word (not the entire word). Folding Transformation The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation is simpler, the folding transformation is more difficult [Parhi et al., 1992]. The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible case transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be folded. For the special case of regular data flow graphs and for linear space–time mappings, the folding tranformation reduces to systolic array design. In the folded architecture, each edge in the algorithm data flow graph is mapped to a communicating edge in the hardware architecture data flow graph. Consider an edge U ÆV in the algorithm data flow graph with associated number of delays i(U ÆV). Let the tasks U and V be mapped to the hardware units HU and HV , respectively. Assume that N time partitions are available, i.e., the iteration period is N. A modulo operation determines the time partition. For example, the time unit 18 for N = 4 corresponds to time partition 18 modulo FIGURE 18.4 (a) A least-significant-bit first bit-serial adder for word length of 4; (b) a digit-serial adder with digit size 2 obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit
4 or 2. Let the tasks U and V be executed in time partitions u and v, i. e, the Ith iterations of tasks U and V are executed in time units N+ u and NI+ y, respectively. The i( U- V) delays in the edge U-Vimplie that the result of the Ith iteration of U is used for the(I+ i)th iteration of V. The(I+ i)th iteration of v executed in time unit N(I+ i)+ v. Thus the number of storage units needed in the folded edge corresponding to the edge l→Vis DE(U-V)=N(I+1)+v-N-u-P=Ni+ v-u-P where P is the level of pipelining of the hardware operator Hy The d(U-v delays should be connected to the edge between Hu and H, and this signal should be switched to the input of Hy at time partition v If the d(U-V)'s as calculated here were always nonnegative for all edges U-V then the problem would be solved. However, some DFO's would be negative. The algorithm data flow graph needs to be pipelined and retimed such that all the D Os are nonnegative. This can be formulated by simple inequalities using the retiming variables. The retiming formulation can be solved as a path problem, and the retiming variables can be determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation of the D,s can be repeated. The folded hardware architecture data flow graph can now be completed. The folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice recursive digital filter of Fig. 18.3(a)is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the pipelined/retimed data flow graph(preprocessed for folding) and Fig 18.5(b)shows the hardware architecture data flow graph obtained after folding As indicated before, a special case of folding can address systolic array design for regular data flow graphs and for linear mappings. The systolic architectures make use of extensive pipelining and local communication and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition, Look-Ahead Technique The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms [Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has een successfully applied to pipeline many signal processing algorithms, including recursive digital filters(in direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal processing algorithms can be operated at high speed. This is an important result since modern signal processing applications in radar and image processing and particularly in high-definition and super- high-definition tele vision video signal processing require very high throughput. Traditional algorithms and topologies cannot be used for such high-speed applications because of the inherent speed bound of the algorithm created by the feedback loops. The look-ahead transformation creates additional concurrency in the signal processing rithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead formation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algo- rithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or by the use of relaxed look-ahead Shanbhag and Parhi, 1992 To begin, consider a time-invariant one-pole recursive digital filter transfer function U(z) 1-a c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 4 or 2. Let the tasks U and V be executed in time partitions u and v, i.e., the lth iterations of tasks U and V are executed in time units Nl + u and Nl + v, respectively. The i(U Æ V) delays in the edge U Æ V implies that the result of the lth iteration of U is used for the (l + i )th iteration of V. The (l + i)th iteration of V is executed in time unit N(l + i) + v. Thus the number of storage units needed in the folded edge corresponding to the edge U Æ V is DF ( U ÆV ) = N(l + i ) + v – Nl – u – Pu = Ni + v – u – Pu where Pu is the level of pipelining of the hardware operator HU. The DF(U Æ V) delays should be connected to the edge between HU and HV , and this signal should be switched to the input of HV at time partition v. If the DF (U Æ V)’s as calculated here were always nonnegative for all edges U Æ V, then the problem would be solved. However, some DF ()’s would be negative. The algorithm data flow graph needs to be pipelined and retimed such that all the DF ()’s are nonnegative. This can be formulated by simple inequalities using the retiming variables. The retiming formulation can be solved as a path problem, and the retiming variables can be determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation of the DF ()’s can be repeated. The folded hardware architecture data flow graph can now be completed. The folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice recursive digital filter of Fig. 18.3(a) is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the pipelined/retimed data flow graph (preprocessed for folding) and Fig. 18.5(b) shows the hardware architecture data flow graph obtained after folding. As indicated before, a special case of folding can address systolic array design for regular data flow graphs and for linear mappings. The systolic architectures make use of extensive pipelining and local communication and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition, and filtering. Look-Ahead Technique The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms [Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has been successfully applied to pipeline many signal processing algorithms, including recursive digital filters (in direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal processing algorithms can be operated at high speed. This is an important result since modern signal processing applications in radar and image processing and particularly in high-definition and super-high-definition television video signal processing require very high throughput. Traditional algorithms and topologies cannot be used for such high-speed applications because of the inherent speed bound of the algorithm created by the feedback loops. The look-ahead transformation creates additional concurrency in the signal processing algorithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead transformation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algorithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or by the use of relaxed look-ahead [Shanbhag and Parhi, 1992]. To begin, consider a time-invariant one-pole recursive digital filter transfer function H z X z U z az ( ) ( ) ( ) = = - - 1 1 1
sA1=(A2,A1 2|+1 FIGURE 18.5 (a)A pipelined/retimed data flow graph obtained from Fig. 18.3(a)by preprocessing for folding;(b)the folded hardware architecture data flow graph. In our folding notation, the tasks are ordered within a set and the ordering represents the time partition in which the task is executed. For example, SA, =(A2, A, implies that A2 and A, are, respectively, executed in even and odd time partitions in the same processor. The notation p represents a null operation (n)=ax(n-1)+t(n) and shown in Fig 18.6(a). The maximum achievable speed in this system is limited by the operating speed of one multiply-add operation. To increase the speed of this system by a factor of 2, we can express x(n) in terms of x(n-2) by substitution of one recursion within the other x(n)=a[ax(n-2)+u(n-1)]+t(n)=a2x(n-2)+a(n-1)+u(n)
© 2000 by CRC Press LLC described by the difference equation x (n) = ax (n – 1) + u (n) and shown in Fig. 18.6(a). The maximum achievable speed in this system is limited by the operating speed of one multiply–add operation. To increase the speed of this system by a factor of 2, we can express x(n) in terms of x (n – 2) by substitution of one recursion within the other: x (n) = a[ax (n – 2) + u (n – 1)] + u (n) = a2x (n – 2) + au (n – 1) + u (n) FIGURE 18.5 (a) A pipelined/retimed data flow graph obtained from Fig. 18.3(a) by preprocessing for folding; (b) the folded hardware architecture data flow graph. In our folding notation, the tasks are ordered within a set and the ordering represents the time partition in which the task is executed. For example, SA1 = (A2, A1) implies that A2 and A1 are,respectively, executed in even and odd time partitions in the same processor. The notation F represents a null operation