Parhi, K.K., Chassaing, R, Bitler, B. "VLSI for Signal Processing The Electrical Engineering Handbook Ed. Richard C. Dorf Boca raton crc Press llc. 2000
Parhi, K.K., Chassaing, R., Bitler, B. “VLSI for Signal Processing” The Electrical Engineering Handbook Ed. Richard C. Dorf Boca Raton: CRC Press LLC, 2000
18 VLSI for Signal Processing 18.1 Special Architectures lining. Parallel Processing Folding Transformation. Look Technique. Associativity Transformation. Distributi Arithmetic Processo Architectures.computer-aidedDesignFutureVlsiDspSystems Keshab K. Parhi 18.2 Signal Processing Chips and Applications DSP Processors. Fixed-Point TMS320C25-Based Development System.Implementation of a Finite Impulse Response Filter with Rulph Chassaing the TMS320C25. Floating-Point TMS320C30-Based Development Roger Williams University System.EVM Tools. Implementation of a Finite Impulse Response Filter with the TMS320C30. FIR and IIR Implementation Bill Bitler Using C and Assembly Code. Real-Time Applications 18.1 Special Architectures Keshab k. parhi Digital signal processing(DSP)is used in numerous applications. These applications include telephony, mobile radio, satellite communications, speech processing, video and image processing, biomedical applications, radar, and sonar. Real-time implementations of DSP systems require design of hardware that can match the application sample rate to the hardware processing rate(which is related to the clock rate and the implementation style Thus, real-time does not always mean high speed. Real-time architectures are capable of processing samples they are received from the signal source, as opposed to storing them in buffers for later processing as done in batch processing. Furthermore, real-time architectures operate on an infinite time series(since the number of the samples of the signal source is so large that it can be considered infinite). While speech and sonar application require lower sample rates, radar and video image processing applications require much higher sample rates. The sample rate information alone cannot be used to choose the architecture. The algorithm complexity is also an important consideration. For example, a very complex and computationally intensive algorithm for a low- sample-rate application and a computationally simple algorithm for a high-sample-rate application may require similar hardware speed and complexity. These ranges of algorithms and applications motivate us to study a wide variety of architecture styles Using very large scale integration(VLSI) technology, DSP algorithms car These options include(1)single or multiprocessor programmable digital signal processors,(2)the use of core programmable digital signal processor with customized interface logic, (3)semicustom gate-array implemen tations, and(4)full-custom dedicated hardware implementation. The DSP algorithms are implemented in the programmable processors by translating the algorithm to the processor assembly code. This can require extensive amount of time. On the other hand, high-level compilers for DSP can be used to generate the assembly de. Although this is currently feasible, the code generated by the compiler is not as efficient as hand-optimized code. Design of DSP compilers for generation of efficient code is still an active research topic. In the case of c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 18 VLSI for Signal Processing 18.1 Special Architectures Pipelining • Parallel Processing • Retiming • Unfolding • Folding Transformation • Look-Ahead Technique • Associativity Transformation • Distributivity • Arithmetic Processor Architectures • Computer-Aided Design • Future VLSI DSP Systems 18.2 Signal Processing Chips and Applications DSP Processors • Fixed-Point TMS320C25-Based Development System • Implementation of a Finite Impulse Response Filter with the TMS320C25 • Floating-Point TMS320C30-Based Development System • EVM Tools • Implementation of a Finite Impulse Response Filter with the TMS320C30 • FIR and IIR Implementation Using C and Assembly Code • Real-Time Applications • Conclusions and Future Directions 18.1 Special Architectures Keshab K. Parhi Digital signal processing (DSP) is used in numerous applications. These applications include telephony, mobile radio, satellite communications, speech processing, video and image processing, biomedical applications, radar, and sonar.Real-time implementations of DSP systems require design of hardware that can match the application sample rate to the hardware processing rate (which is related to the clock rate and the implementation style). Thus, real-time does not always mean high speed. Real-time architectures are capable of processing samples as they are received from the signal source, as opposed to storing them in buffers for later processing as done in batch processing. Furthermore, real-time architectures operate on an infinite time series (since the number of the samples of the signal source is so large that it can be considered infinite).While speech and sonar applications require lower sample rates, radar and video image processing applications require much higher sample rates. The sample rate information alone cannot be used to choose the architecture. The algorithm complexity is also an important consideration. For example, a very complex and computationally intensive algorithm for a lowsample-rate application and a computationally simple algorithm for a high-sample-rate application may require similar hardware speed and complexity. These ranges of algorithms and applications motivate us to study a wide variety of architecture styles. Using very large scale integration (VLSI) technology, DSP algorithms can be prototyped in many ways. These options include (1) single or multiprocessor programmable digital signal processors, (2) the use of core programmable digital signal processor with customized interface logic, (3) semicustom gate-array implementations, and (4) full-custom dedicated hardware implementation. The DSP algorithms are implemented in the programmable processors by translating the algorithm to the processor assembly code. This can require an extensive amount of time. On the other hand, high-level compilers for DSP can be used to generate the assembly code.Although this is currently feasible, the code generated by the compiler is not as efficient as hand-optimized code. Design of DSP compilers for generation of efficient code is still an active research topic. In the case of Keshab K. Parhi University of Minnesota Rulph Chassaing Roger Williams University Bill Bitler InfiMed
dedicated designs, the challenge lies in a thorough understanding of the DSP algorithms and theor tectures. For example, just minimizing the number of multipliers in an algorithm may not lead to dedicated design. The area saved by the number of multipliers may be offset by the increase in control, s Off-the-shelf programmable digital signal processors can lead to faster prototyping. These prototyped systems can prove very effective in fast simulation of computation-intensive algorithms(such as those encountered in peech recognition, video compression, and seismic signal processing) or in benchmarking and standardization After standards are determined, it is more useful to implement the algorithms using dedicated circuits Design of dedicated circuits is not a simple task. Dedicated circuits provide limited or no programming flexibility. They require less silicon area and consume less power. However, the low production volume, high lesign cost, and long turnaround time are some of the difficulties associated with the design of dedicated systems. Another difficulty is the availability of appropriate computer-aided design( CAD)tools for DSP systems. As time progresses, however, the architectural design techniques will be better understood and can be incor porated into CAD tools, thus making the design of dedicated circuits easier. Hierarchical CAD tools can integrate the design at various levels in an automatic and efficient manner. Implementation of standards for signal and mage processing using dedicated circuits will lead to higher volume production. As time progresses, dedicated designs will be more acceptable to customers of DSI Successful design of dedicated circuits requires careful algorithm and architecture considerations For exam ple, for a filtering application, different equivalent realizations may possess different levels of concurrency. Thus, some of these realizations may be suitable for a particular application while other realizations may not be able to meet the sample rate requirements of the application. The lower-level architecture may be implemented in a word-serial or word-parallel manner. The arithmetic functional units may be implemented in bit-serial or digit-serial or bit-parallel manner. The synthesized architecture may be implemented with a dedicated data path or shared data path. The architecture may be systolic or nonsystolic Algorithm transformations play an important role in the design of dedicated architectures [ Parhi, 1989 This is because the transformed algorithms can be made to operate with better performance(where the performance may be measured in terms of speed, area, or power). Examples of these transformations include pipelining, parallel processing, retiming, unfolding, folding, look-ahead, associativity, and distributivity. These transformations and other architectural concepts are described in detail in subsequent sections lDeuinin g Pipelining can increase the amount of concurrency(or the number of activities performed simultaneously) in an algorithm. Pipelining is accomplished by placing latches at appropriate intermediate points in a data flow graph that describes the algorithm. Each latch also refers to a storage unit or buffer or register. The latches can be placed at feed-forward cutsets of the data flow graph In synchronous hardware implementations, pipelining n increase the clock rate of the system(and therefore the sample rate). The drawbacks associated with pipelining are the increase in system latency and the increase in the number of registers. To illustrate the speed increase using pipelining, consider the second-order three-tap finite impulse response(FIR) filter shown in Fig. 18.1(a). The signal x(n) in this system can be sampled at a rate limited by the throughput of one multiplication and two additions. For simplicity, if we assume the multiplication time to be two times the addition time( Tadd), the effective sample or clock rate of this system is 1/4Tadd By placing latches as shown in Fig. 18.1(b)at the cutset shown in the dashed line, the sample rate can be improved to the rate of one multiplication or two additions. While pipelining can be easily applied to all algorithms with no feedback loops by the appropriate placement of latches, it cannot easily be applied to algorithms with feedback loops. This is because the cutsets in feedback algorithms contain feed-forward and feedback data flow and cannot be con sidered as feed-forward cutsets Pipelining can also be used to improve the performance in sof Most software programmable DSP processors are programmed using assembly code. The assembly code is generated by high-level compilers that perform scheduling. Schedulers typically use the acyclic precedence graph to construct schedules. The removal of all edges in the signal (or data) flow graph containing delay c 2000 by CRC Press LLC
© 2000 by CRC Press LLC dedicated designs, the challenge lies in a thorough understanding of the DSP algorithms and theory of architectures. For example, just minimizing the number of multipliers in an algorithm may not lead to a better dedicated design. The area saved by the number of multipliers may be offset by the increase in control, routing, and placement costs. Off-the-shelf programmable digital signal processors can lead to faster prototyping. These prototyped systems can prove very effective in fast simulation of computation-intensive algorithms (such as those encountered in speech recognition, video compression, and seismic signal processing) or in benchmarking and standardization. After standards are determined, it is more useful to implement the algorithms using dedicated circuits. Design of dedicated circuits is not a simple task. Dedicated circuits provide limited or no programming flexibility. They require less silicon area and consume less power. However, the low production volume, high design cost, and long turnaround time are some of the difficulties associated with the design of dedicated systems.Another difficulty is the availability of appropriate computer-aided design (CAD) tools for DSP systems. As time progresses, however, the architectural design techniques will be better understood and can be incorporated into CAD tools, thus making the design of dedicated circuits easier.Hierarchical CAD tools can integrate the design at various levels in an automatic and efficient manner. Implementation of standards for signal and image processing using dedicated circuits will lead to higher volume production. As time progresses, dedicated designs will be more acceptable to customers of DSP. Successful design of dedicated circuits requires careful algorithm and architecture considerations. For example, for a filtering application, different equivalent realizations may possess different levels of concurrency. Thus, some of these realizations may be suitable for a particular application while other realizations may not be able to meet the sample rate requirements of the application. The lower-level architecture may be implemented in a word-serial or word-parallel manner. The arithmetic functional units may be implemented in bit-serial or digit-serial or bit-parallel manner. The synthesized architecture may be implemented with a dedicated data path or shared data path. The architecture may be systolic or nonsystolic. Algorithm transformations play an important role in the design of dedicated architectures [Parhi, 1989]. This is because the transformed algorithms can be made to operate with better performance (where the performance may be measured in terms of speed, area, or power). Examples of these transformations include pipelining, parallel processing, retiming, unfolding, folding, look-ahead, associativity, and distributivity. These transformations and other architectural concepts are described in detail in subsequent sections. Pipelining Pipelining can increase the amount of concurrency (or the number of activities performed simultaneously) in an algorithm. Pipelining is accomplished by placing latches at appropriate intermediate points in a data flow graph that describes the algorithm. Each latch also refers to a storage unit or buffer or register. The latches can be placed at feed-forward cutsets of the data flow graph. In synchronous hardware implementations, pipelining can increase the clock rate of the system (and therefore the sample rate). The drawbacks associated with pipelining are the increase in system latency and the increase in the number of registers. To illustrate the speed increase using pipelining, consider the second-order three-tap finite impulse response (FIR) filter shown in Fig. 18.1(a). The signal x(n) in this system can be sampled at a rate limited by the throughput of one multiplication and two additions. For simplicity, if we assume the multiplication time to be two times the addition time (Tadd), the effective sample or clock rate of this system is 1/4Tadd. By placing latches as shown in Fig. 18.1(b) at the cutset shown in the dashed line, the sample rate can be improved to the rate of one multiplication or two additions. While pipelining can be easily applied to all algorithms with no feedback loops by the appropriate placement of latches, it cannot easily be applied to algorithms with feedback loops. This is because the cutsets in feedback algorithms contain feed-forward and feedback data flow and cannot be considered as feed-forward cutsets. Pipelining can also be used to improve the performance in software programmable multiprocessor systems. Most software programmable DSP processors are programmed using assembly code. The assembly code is generated by high-level compilers that perform scheduling. Schedulers typically use the acyclic precedence graph to construct schedules. The removal of all edges in the signal (or data) flow graph containing delay
FIGURE 18.1 (a)A three-tap second-order nonrecursive digital filter;(b) the equivalent pipelined digital filter by placing storage units at the intersection of the signal wires and the feed-forward cutset. If the multiplication and operations require 2 and 1 unit of time, respectively, then the maximum achievable sampling rates for the original pipelined architectures are 1/4 and 1/2 units, respectively elements converts the signal flow graph to an acyclic precedence graph. By placing latches to pipeline a data low graph, we can alter the acyclic precedence graph. In particular, the critical path of the acyclic precedence graph can be reduced. The new precedence graph can be used to construct schedules with lower iteration periods(although this may often require an increase in the number of processors) Pipelining of algorithms can increase the sample rate of the system. Sometimes, for a constant sample rat ipelining can also reduce the power consumed by the system. This is because the data paths in the pipelined system can be charged or discharged with lower supply voltage. Since the capacitance remains almost constant, the power can be reduced. Achieving low power can be important in many battery-powered applications [Chandrakasan et al., 1992] Parallel Processing Parallel processing is related to pipelining but requires replication of hardware units. Pipelining exploits concurrency by breaking a large task into multiple smaller tasks and by separating these smaller tasks by storage units. On the other hand, parallelism exploits concurrency by performing multiple larger tasks simultaneously in separate hardware units To illustrate the speed increase due to parallelism, consider the parallel implementation of the second-order three-tap FIR filter of Fig. 18.1(a)shown in Fig. 18. 2. In the architecture of Fig. 18.2, two input samples are processed and two output samples are generated in each clock cycle period of four addition times. Because each clock cycle processes two samples, however, the effective sample rate is 1/2Tdd which is the same as that of Fig. 18.1(b). The parallel architecture leads to the speed increase with significant hardware overhead. The entire data flow graph needs to be replicated with an increase in the amount of parallelism. Thus, it is more desirable to use pipelining as opposed to parallelism. However, parallelism may be useful if pipelining alone annot meet the speed demand of the application or if the technology constraints(such as limitations on the clock rate by the I/O technology) limit the use of pipelining. In obvious ways, pipelining and parallelism can be combined also Parallelism, like pipelining, can also lead to power reduction but with significant overhead in hardware requirements. Achieving pipelining and parallelism can be difficult for systems with feedback loops Concurrency may be created in these systems by using the look-ahead transformation c 2000 by CRC Press LLC
© 2000 by CRC Press LLC elements converts the signal flow graph to an acyclic precedence graph. By placing latches to pipeline a data flow graph, we can alter the acyclic precedence graph. In particular, the critical path of the acyclic precedence graph can be reduced. The new precedence graph can be used to construct schedules with lower iteration periods (although this may often require an increase in the number of processors). Pipelining of algorithms can increase the sample rate of the system. Sometimes, for a constant sample rate, pipelining can also reduce the power consumed by the system. This is because the data paths in the pipelined system can be charged or discharged with lower supply voltage. Since the capacitance remains almost constant, the power can be reduced. Achieving low power can be important in many battery-powered applications [Chandrakasan et al., 1992]. Parallel Processing Parallel processing is related to pipelining but requires replication of hardware units. Pipelining exploits concurrency by breaking a large task into multiple smaller tasks and by separating these smaller tasks by storage units. On the other hand, parallelism exploits concurrency by performing multiple larger tasks simultaneously in separate hardware units. To illustrate the speed increase due to parallelism, consider the parallel implementation of the second-order three-tap FIR filter of Fig. 18.1(a) shown in Fig. 18.2. In the architecture of Fig. 18.2, two input samples are processed and two output samples are generated in each clock cycle period of four addition times. Because each clock cycle processes two samples, however, the effective sample rate is 1/2Tadd which is the same as that of Fig. 18.1(b). The parallel architecture leads to the speed increase with significant hardware overhead. The entire data flow graph needs to be replicated with an increase in the amount of parallelism. Thus, it is more desirable to use pipelining as opposed to parallelism. However, parallelism may be useful if pipelining alone cannot meet the speed demand of the application or if the technology constraints (such as limitations on the clock rate by the I/O technology) limit the use of pipelining. In obvious ways, pipelining and parallelism can be combined also. Parallelism, like pipelining, can also lead to power reduction but with significant overhead in hardware requirements.Achieving pipelining and parallelism can be difficult for systems with feedback loops. Concurrency may be created in these systems by using the look-ahead transformation. FIGURE 18.1 (a) A three-tap second-order nonrecursive digital filter; (b) the equivalent pipelined digital filter obtained by placing storage units at the intersection of the signal wires and the feed-forward cutset. If the multiplication and addition operations require 2 and 1 unit of time, respectively, then the maximum achievable sampling rates for the original and the pipelined architectures are 1/4 and 1/2 units, respectively
(2k+1)x(2k X(2k-1) ·a (·a FIGURE 18.2 Twofold parallel realization of the three-tap filter of Fig. 18.1(a) Retiming is similar to pipelining but yet different in some ways [Leiserson et al, 1983]. Retiming is the process of moving the delays around in the data flow graph. Removal of one delay from all input edges of a node and insertion of one delay to each outgoing edge of the same node is the simplest example of retiming. Unlike pipelining, retiming does not increase the latency of the system. However, retiming alters the number of delay elements in the system. Retiming can reduce the critical path of the data flow graph As a result, it can lead to clock period reduction in hardware implementations or critical path of the acyclic precedence graph or the iteration period in programmable software system implementations The single host formulation of the retiming transformation preserves the latency of the algorithm. The retiming formulation with no constraints on latency(ie, with separate input and output hosts) can also achieve pipelining with no retiming or pipelining with retiming. Pipelining with retiming is the most desirable transfor- mation in DSP architecture be interpreted to be identical to retiming of the original algorithm with a large number of delays at the input edges. Thus, we can increase the system latency arbitrarily and remove the appropriate number of delays from the inputs after the transformation The retiming formulation assigns retiming variables r() to each node in the data flow graph. If i(U-v) is the number of delays associated with the edge U-V in the original data flow graph and r( v)and r(U), respectively, represent the retiming variable value of the nodes V and U, then the number of delays associated with the edge U-V in the retimed data flow graph is given by (U→V)=i(U→V)+r(V)-r(U) For the data flow graph to be realizable, i, ( U-V)20 must be satisfied. The retiming transformation formulates ne problem by calculating path lengths and by imposing constraints on certain path lengths. These constraints To illustrate the usefulness of retiming, consider the data flow graph of a two-stage pipelined lattice digital filter graph shown in Fig. 18.3(a)and its equivalent pipelined-retimed data flow graph shown in Fig. 18.3(b) If the multiply time is two units and the add time is one unit, the architecture in Fig. 18. 3(a)can be clocked with period 10 units whereas the architecture in Fig. 18. 3(b) can be clocked with period 2 units
© 2000 by CRC Press LLC Retiming Retiming is similar to pipelining but yet different in some ways [Leiserson et al., 1983]. Retiming is the process of moving the delays around in the data flow graph. Removal of one delay from all input edges of a node and insertion of one delay to each outgoing edge of the same node is the simplest example of retiming. Unlike pipelining, retiming does not increase the latency of the system. However, retiming alters the number of delay elements in the system. Retiming can reduce the critical path of the data flow graph. As a result, it can lead to clock period reduction in hardware implementations or critical path of the acyclic precedence graph or the iteration period in programmable software system implementations. The single host formulation of the retiming transformation preserves the latency of the algorithm. The retiming formulation with no constraints on latency (i.e.,with separate input and output hosts) can also achieve pipelining with no retiming or pipelining with retiming. Pipelining with retiming is the most desirable transformation in DSP architecture design. Pipelining with retiming can be interpreted to be identical to retiming of the original algorithm with a large number of delays at the input edges. Thus, we can increase the system latency arbitrarily and remove the appropriate number of delays from the inputs after the transformation. The retiming formulation assigns retiming variables r(.) to each node in the data flow graph. If i(U ÆV) is the number of delays associated with the edge U ÆV in the original data flow graph and r(V) and r(U ), respectively, represent the retiming variable value of the nodes V and U, then the number of delays associated with the edge U ÆV in the retimed data flow graph is given by ir(U ÆV ) = i ( U ÆV ) + r ( V ) – r (U ) For the data flow graph to be realizable, ir(U ÆV) ³ 0 must be satisfied. The retiming transformation formulates the problem by calculating path lengths and by imposing constraints on certain path lengths. These constraints are solved as a shortest-path problem. To illustrate the usefulness of retiming, consider the data flow graph of a two-stage pipelined lattice digital filter graph shown in Fig. 18.3(a) and its equivalent pipelined-retimed data flow graph shown in Fig. 18.3(b). If the multiply time is two units and the add time is one unit, the architecture in Fig. 18.3(a) can be clocked with period 10 units whereas the architecture in Fig. 18.3(b) can be clocked with period 2 units. FIGURE 18.2 Twofold parallel realization of the three-tap filter of Fig. 18.1(a)