当前位置：和泉文库 > 电子与通信 > 浏览文档

《电子工程师手册》学习资料（英文版）Chapter 18 VLSI for Signal Processing

18.1 Special Architectures Pipelining. Parallel Processing. Retiming. Unfolding. ia Folding Transformation Look-Ahead Technique. Associativity Transformation. Distributivity Arithmetic Processor Architectures. Computer-Aided Design. Future VLSI DSP Systems Keshab K. Parhi 18.2 Signal Processing Chips and Applications University of Minnesota System. Implementation of a Finite Impulse Response Filter with Rulph Chassaing the TMS320C25. Floating-Point TMS320C30-Based Development Roger Williams University

文件格式：PDF，文件大小：442.66KB，售价：7.3元

共25页，可试读9页，点击往前阅读 ↑↑

文档详细内容（约25页）

y3 y2 y, yo X(2k+1) S(2K+1) y(2k+1) s3828s IGURE 18.4(a)A least-significant-bit first bit-serial adder for word length of 4;(b)a digit-serial adder with digit size 2 obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit. Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level. Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit size J. The digit size represents the number of bits processed per clock cycle git-serial architecture is locked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/C bound much before reaching the computation bound of the bit-serial program). Because the digit-serial program processes Jbits per clock cycle the effective bit rate of the digit-serial architecture is Times higher. A simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded oy a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process a digit of each word (not the entire word) Folding Transformation The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation is simpler, the folding transformation is more difficult [Parhi et al., 199 The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible ase transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be folded. For the special case of regular data flow graphs and for linear space-time mappings, the folding tranformation reduces to systolic array design. In the folded architecture, each edge in the algorithm data flow graph unication the hardware architecture data flow graph. Consider an edge U- Vin the algorithm data flow graph with sociated number of delays i( U-v). Let the tasks U and V be mapped to the hardware units Hy and H, respectively. Assume that N time partitions are available, i. e, the iteration period is N. A modulo operation determines the time partition. For example, the time unit 18 for N= 4 corresponds to time partition 18 modulo c 2000 by CRC Press LLC

© 2000 by CRC Press LLC Because the unfolding algorithm is based on graph theoretic approach, it can also be applied at the bit level. Thus, unfolding of a bit-serial data flow program by a factor of J leads to a digit-serial program with digit size J. The digit size represents the number of bits processed per clock cycle. The digit-serial architecture is clocked at the same rate as the bit-serial (assuming that the clock rate is limited by the communication I/O bound much before reaching the computation bound of the bit-serial program). Because the digit-serial program processes J bits per clock cycle the effective bit rate of the digit-serial architecture is J times higher. A simple example of this unfolding is illustrated in Fig. 18.4, where the bit-serial adder in Fig. 18.4(a) is unfolded by a factor of 2 to obtain the digit-serial adder in Fig. 18.4(b) for digit size 2 for a word length of 4. In obvious ways, the unfolding transformation can be applied to both word level and bit level simultaneously to generate word-parallel digit-serial architectures. Such architectures process multiple words per clock cycle and process a digit of each word (not the entire word). Folding Transformation The folding transformation is the reverse of the unfolding transformation. While the unfolding transformation is simpler, the folding transformation is more difficult [Parhi et al., 1992]. The folding transformation can be applied to fold a bit-parallel architecture to a digit-serial or bit-serial one or to fold a digit-serial architecture to a bit-serial one. It can also be applied to fold an algorithm data flow graph to a hardware data flow for a specified folding set. The folding set indicates the processor in which and the time partition at which a task is executed. A specified folding set may be infeasible, and this needs to be detected first. The folding transformation performs a preprocessing step to detect feasibility and in the feasible case transforms the algorithm data flow graph to an equivalent pipelined/retimed data flow graph that can be folded. For the special case of regular data flow graphs and for linear space–time mappings, the folding tranformation reduces to systolic array design. In the folded architecture, each edge in the algorithm data flow graph is mapped to a communicating edge in the hardware architecture data flow graph. Consider an edge U ÆV in the algorithm data flow graph with associated number of delays i(U ÆV). Let the tasks U and V be mapped to the hardware units HU and HV , respectively. Assume that N time partitions are available, i.e., the iteration period is N. A modulo operation determines the time partition. For example, the time unit 18 for N = 4 corresponds to time partition 18 modulo FIGURE 18.4 (a) A least-significant-bit first bit-serial adder for word length of 4; (b) a digit-serial adder with digit size 2 obtained by two-unfolding of the bit-serial adder. The bit position 0 stands for least significant bit

4 or 2. Let the tasks U and V be executed in time partitions u and v, i. e, the Ith iterations of tasks U and V are executed in time units N+ u and NI+ y, respectively. The i( U- V) delays in the edge U-Vimplie that the result of the Ith iteration of U is used for the(I+ i)th iteration of V. The(I+ i)th iteration of v executed in time unit N(I+ i)+ v. Thus the number of storage units needed in the folded edge corresponding to the edge l→Vis DE(U-V)=N(I+1)+v-N-u-P=Ni+ v-u-P where P is the level of pipelining of the hardware operator Hy The d(U-v delays should be connected to the edge between Hu and H, and this signal should be switched to the input of Hy at time partition v If the d(U-V)'s as calculated here were always nonnegative for all edges U-V then the problem would be solved. However, some DFO's would be negative. The algorithm data flow graph needs to be pipelined and retimed such that all the D Os are nonnegative. This can be formulated by simple inequalities using the retiming variables. The retiming formulation can be solved as a path problem, and the retiming variables can be determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation of the D,s can be repeated. The folded hardware architecture data flow graph can now be completed. The folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice recursive digital filter of Fig. 18.3(a)is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the pipelined/retimed data flow graph(preprocessed for folding) and Fig 18.5(b)shows the hardware architecture data flow graph obtained after folding As indicated before, a special case of folding can address systolic array design for regular data flow graphs and for linear mappings. The systolic architectures make use of extensive pipelining and local communication and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition, Look-Ahead Technique The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms [Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has een successfully applied to pipeline many signal processing algorithms, including recursive digital filters(in direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal processing algorithms can be operated at high speed. This is an important result since modern signal processing applications in radar and image processing and particularly in high-definition and super- high-definition tele vision video signal processing require very high throughput. Traditional algorithms and topologies cannot be used for such high-speed applications because of the inherent speed bound of the algorithm created by the feedback loops. The look-ahead transformation creates additional concurrency in the signal processing rithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead formation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algo- rithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or by the use of relaxed look-ahead Shanbhag and Parhi, 1992 To begin, consider a time-invariant one-pole recursive digital filter transfer function U(z) 1-a c 2000 by CRC Press LLC

© 2000 by CRC Press LLC 4 or 2. Let the tasks U and V be executed in time partitions u and v, i.e., the lth iterations of tasks U and V are executed in time units Nl + u and Nl + v, respectively. The i(U Æ V) delays in the edge U Æ V implies that the result of the lth iteration of U is used for the (l + i )th iteration of V. The (l + i)th iteration of V is executed in time unit N(l + i) + v. Thus the number of storage units needed in the folded edge corresponding to the edge U Æ V is DF ( U ÆV ) = N(l + i ) + v – Nl – u – Pu = Ni + v – u – Pu where Pu is the level of pipelining of the hardware operator HU. The DF(U Æ V) delays should be connected to the edge between HU and HV , and this signal should be switched to the input of HV at time partition v. If the DF (U Æ V)’s as calculated here were always nonnegative for all edges U Æ V, then the problem would be solved. However, some DF ()’s would be negative. The algorithm data flow graph needs to be pipelined and retimed such that all the DF ()’s are nonnegative. This can be formulated by simple inequalities using the retiming variables. The retiming formulation can be solved as a path problem, and the retiming variables can be determined if a solution exists. The algorithm data flow graph can be retimed for folding and the calculation of the DF ()’s can be repeated. The folded hardware architecture data flow graph can now be completed. The folding technique is illustrated in Fig. 18.5. The algorithm data flow graph of a two-stage pipelined lattice recursive digital filter of Fig. 18.3(a) is folded for the folding set shown in Fig. 18.5. Fig. 18.5(a) shows the pipelined/retimed data flow graph (preprocessed for folding) and Fig. 18.5(b) shows the hardware architecture data flow graph obtained after folding. As indicated before, a special case of folding can address systolic array design for regular data flow graphs and for linear mappings. The systolic architectures make use of extensive pipelining and local communication and operate in a synchronous manner [Kung, 1988]. The systolic processors can also be made to operate in an asynchronous manner, and such systems are often referred to as wavefront processors. Systolic architectures have been designed for a variety of applications including convolution, matrix solvers, matrix decomposition, and filtering. Look-Ahead Technique The look-ahead technique is a very powerful technique for pipelining of recursive signal processing algorithms [Parhi and Messerschmitt, 1989]. This technique can transform a sequential recursive algorithm to an equivalent concurrent one, which can then be realized using pipelining or parallel processing or both. This technique has been successfully applied to pipeline many signal processing algorithms, including recursive digital filters (in direct form and lattice form), adaptive lattice digital filters, two-dimensional recursive digital filters, Viterbi decoders, Huffman decoders, and finite state machines. This research demonstrated that the recursive signal processing algorithms can be operated at high speed. This is an important result since modern signal processing applications in radar and image processing and particularly in high-definition and super-high-definition television video signal processing require very high throughput. Traditional algorithms and topologies cannot be used for such high-speed applications because of the inherent speed bound of the algorithm created by the feedback loops. The look-ahead transformation creates additional concurrency in the signal processing algorithms and the speed bound of the transformed algorithms is increased substantially. The look-ahead transformation is not free from its drawbacks. It is accompanied by an increase in the hardware overhead. This difficulty has encouraged us to develop inherently pipelinable topologies for recursive signal processing algorithms. Fortunately, this is possible to achieve in adaptive digital filters using relaxations on the look-ahead or by the use of relaxed look-ahead [Shanbhag and Parhi, 1992]. To begin, consider a time-invariant one-pole recursive digital filter transfer function H z X z U z az ( ) ( ) ( ) = = - - 1 1 1

点击进入文档下载页（PDF格式）

共25页，试读已结束，阅读完整版请下载

您可能感兴趣的文档

点击购买下载（PDF）

下载及服务说明

购买前请先查看本文档预览页，确认内容后再进行支付；
如遇文件无法下载、无法访问或其它任何问题，可发送电子邮件反馈，核实后将进行文件补发或退款等其它相关操作；
邮箱：

文档浏览记录