1990]. VSELP is a forward-adaptive form of CELP where two excitation codebooks are used to reduce the complexity of encoding Other approaches to complexity reduction in CELP coders are related to"sparse"codebook entries which have few nonzero samples per vector and"algebraic"codebooks which are based on integer lattices [Adoul and Lamblin, 1987 In this case, excitation code vectors can be constructed on an as-needed basis instead of being stored in a table. ITU-T standardization of a CELP algorithm which uses lattice based excitations has resulted in the 8 kbps G729(ACELP)coder U.S. Federal Standard 1016[ National Communications System, 1991] is a 4.8 kbps CELP coder. It has both long-and short-term linear predictors which are forward adaptive, and so the coder has a relatively large delay (100 msec). This coder produces highly intelligible, good-quality speech in a variety of environments and is robust to independent bit errors. Below about 4 kbps, the subjective quality of CElP coders is inferior to other architectures. Much research variable-rate CELP implementations has resulted in alternative coder architectures which adjust their coding rates based on a number of channel conditions or sophisticated, speech-specific cues such as phonetic segmen- tation[ Wang and Gersho, 1989; Paksoy et al., 1993]. Notably, most variable-rate CELP coders are implemer tations of finite-state CELP wherein a vector of speech cues controls the evolution of a state-machine to prescribe mode-dependent bit allocations for coder parameters. With these architectures, excellent speech quality at average rates below 2 kbps has been reported. MBE The MBE coder[ Hardwick and Lim, 1991] is an efficient frequency-domain architecture partially based on the concepts of sinusoidal transform coding(STC)[McAulay and Quatieri, 1986]. In MBE, the instantaneous pectral envelope is represented explicitly by harmonic estimates in several subbands. The performance of mBE coders at rates below 4 kbps is generally"better"than that of CELP-based schemes An MBE coder decomposes the instantaneous speech spectrum into subbands centered at harmonics of the fundamental glottal excitation(pitch). The spectral envelope of the signal is approximated by samples taken at pitch harmonics, and these harmonic amplitudes are compared to adaptive thresholds(which may be determined via analysis-by-synthesis)to determine subbands of high spectral activity. Subbands that are determined to be voiced"are labeled, and their energies and phases are encoded for transmission. Subbands having relatively low spectral activity are declared"unvoiced". These segments are approximated by an appropriately filtered segment of white noise, or a locally dense collection of sinusoids with random phase. Careful tracking of the evolution of individual spectral peaks and phases in successive frames is critical in the implementation of MBE-style coders An efficient implementation of an MBE coder was adopted for the International Maritime Satellite(INMar SAT) voice processor, and is known as Improved-MBE, or IMBE [ Hardwick and Lim, 1991]. This coder optimizes several components of the general MBE architecture, including grouping neighboring harmonics for subband voicing decisions, using non-integer pitch resolution for higher speaker fidelity, and differentially encoding the log-amplitudes of voiced harmonics using a DCT-based scheme. The IMBe coder requires high delay (about 80 msec), but produces very good quality encoded speech MELP The MELP coder [McCree and Barnwell, 1995] is based on the traditional LPC vocoder model where an LPC synthesis filter is excited by an impulse train( voiced speech) or white noise(unvoiced speech). The MELP excitation, however, has characteristics that are more similar to natural human speech. In particular, the MELP excitation can be a mixture of(possibly aperiodic) pulses with bandlimited noise. In MELP, the excitation spectrum is explicitly modeled using Fourier series coefficients and bandpass voicing strengths, and the time- domain excitation sequence is produced from the spectral model via an inverse transform. The synthetic xcitation sequence is then used to drive an LPC synthesizer which introduces formant spectral shaping Common thread In addition to the use of analysis-by-synthesis techniques and/or LPC modeling, a common thread between low-rate, forward adaptive CELP, MBE, and mElP coders is the dependence on an estimate of the fundamental glottal frequency, or pitch period. CELP coders typically employ a pitch or long-term predictor to characterize c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 1990]. VSELP is a forward-adaptive form of CELP where two excitation codebooks are used to reduce the complexity of encoding. • Other approaches to complexity reduction in CELP coders are related to “sparse” codebook entries which have few nonzero samples per vector and “algebraic” codebooks which are based on integer lattices [Adoul and Lamblin, 1987]. In this case, excitation code vectors can be constructed on an as-needed basis instead of being stored in a table. ITU-T standardization of a CELP algorithm which uses latticebased excitations has resulted in the 8 kbps G.729 (ACELP) coder. • U.S. Federal Standard 1016 [National Communications System, 1991] is a 4.8 kbps CELP coder. It has both long- and short-term linear predictors which are forward adaptive, and so the coder has a relatively large delay (100 msec). This coder produces highly intelligible, good-quality speech in a variety of environments and is robust to independent bit errors. Below about 4 kbps, the subjective quality of CELP coders is inferior to other architectures. Much research in variable-rate CELP implementations has resulted in alternative coder architectures which adjust their coding rates based on a number of channel conditions or sophisticated, speech-specific cues such as phonetic segmentation [Wang and Gersho, 1989; Paksoy et al., 1993]. Notably, most variable-rate CELP coders are implementations of finite-state CELP wherein a vector of speech cues controls the evolution of a state-machine to prescribe mode-dependent bit allocations for coder parameters. With these architectures, excellent speech quality at average rates below 2 kbps has been reported. MBE The MBE coder [Hardwick and Lim, 1991] is an efficient frequency-domain architecture partially based on the concepts of sinusoidal transform coding (STC) [McAulay and Quatieri, 1986]. In MBE, the instantaneous spectral envelope is represented explicitly by harmonic estimates in several subbands. The performance of MBE coders at rates below 4 kbps is generally “better” than that of CELP-based schemes. An MBE coder decomposes the instantaneous speech spectrum into subbands centered at harmonics of the fundamental glottal excitation (pitch). The spectral envelope of the signal is approximated by samples taken at pitch harmonics, and these harmonic amplitudes are compared to adaptive thresholds (which may be determined via analysis-by-synthesis) to determine subbands of high spectral activity. Subbands that are determined to be “voiced” are labeled, and their energies and phases are encoded for transmission. Subbands having relatively low spectral activity are declared “unvoiced”. These segments are approximated by an appropriately filtered segment of white noise, or a locally dense collection of sinusoids with random phase. Careful tracking of the evolution of individual spectral peaks and phases in successive frames is critical in the implementation of MBE-style coders. An efficient implementation of an MBE coder was adopted for the International Maritime Satellite (INMARSAT) voice processor, and is known as Improved-MBE, or IMBE [Hardwick and Lim, 1991]. This coder optimizes several components of the general MBE architecture, including grouping neighboring harmonics for subband voicing decisions, using non-integer pitch resolution for higher speaker fidelity, and differentially encoding the log-amplitudes of voiced harmonics using a DCT-based scheme. The IMBE coder requires high delay (about 80 msec), but produces very good quality encoded speech. MELP The MELP coder [McCree and Barnwell, 1995] is based on the traditional LPC vocoder model where an LPC synthesis filter is excited by an impulse train (voiced speech) or white noise (unvoiced speech). The MELP excitation, however, has characteristics that are more similar to natural human speech. In particular, the MELP excitation can be a mixture of (possibly aperiodic) pulses with bandlimited noise. In MELP, the excitation spectrum is explicitly modeled using Fourier series coefficients and bandpass voicing strengths, and the timedomain excitation sequence is produced from the spectral model via an inverse transform. The synthetic excitation sequence is then used to drive an LPC synthesizer which introduces formant spectral shaping. Common Threads In addition to the use of analysis-by-synthesis techniques and/or LPC modeling, a common thread between low-rate, forward adaptive CELP, MBE, and MELP coders is the dependence on an estimate of the fundamental glottal frequency, or pitch period. CELP coders typically employ a pitch or long-term predictor to characterize
the glottal excitation. MBE coders estimate the fundamental frequency and use this estimate to focus subband decompositions on harmonics. MELP coders perform pitch-synchronous excitation modeling. Overall coder performance is enhanced in the CELP and MBe coders with the use of sub-integer lags[Kroon and Atal, 1991] This is equivalent to performing pitch estimation using a signal sampled at a higher sampling rate to improve the precision of the spectral estimate. Highly precise glottal frequency estimation improves the"naturalness of coded speech at the expense of increased computational complexity, and in some cases increased bit rate. Accurate characterization of pitch and LPC parameters can also be used to good advantage in postfiltering to reduce apparent quantization noise. These filters, usually derived from forward-adapted filter coefficients transmitted to the receiver as side-information, perform post-processing on the reconstructed speech which reduces perceptually annoying noise components [Chen and Gersho, 1995] Speech Quality and Intelligibility To compare the performance of two speech coders, it is necessary to have some indicator of the intelligibility and quality of the speech produced by each coder. The term intelligibility usually refers to whether the output speech is easily understandable, while the term quality is an indicator of how natural the speech sounds. It is possible for a coder to produce highly intelligible speech that is low quality in that the speech may sound very achine- like and the speaker is not identifiable. On the other hand, it is unlikely that unintelligible speech would be called high quality, but there are situations in which perceptually pleasing speech does not have high intelligibility. We briefly discuss here the most common measures of intelligibility and quality used in formal tests of speech coders DRT The diagnostic rhyme test(DRT) was devised by Voiers [1977] to test the intelligibility of coders known to produce speech of lower quality. Rhyme tests are so named because the listener must determine which consonant was spoken when presented with a pair of rhyming words; that is, the listener is asked to distinguish between word pairs such as meat-beat, pool-tool, saw-thaw, and caught-taught. Each pair of words differs on only one of six phonemic attributes: voicing, nasality, sustention, sibilation, graveness, and compactness. Specifically, the listener is presented with one spoken word from a pair and asked to decide which word was spoken. The final DRT score is the percent responses computed according to P=T(R-WX100, where R is the number correctly chosen, w is the number of incorrect choices, and Tis the total of word pairs tested. Usually, 75 s DRT $95, with a good being about 90[ Papamichalis, 1987] MOS The Mean Opinion Score(MOS) is an often-used performance measure [Jayant and Noll, 1984]. To establish a MOS for a coder, listeners are asked to classify the quality of the encoded speech in one of five categories excellent(5), good(4), fair (3), poor(2), or bad(1). Alternatively, the listeners may be asked to classify the coded speech according to the amount of perceptible distortion present, ie, imperceptible(5), barely percep- tible but not annoying(4), perceptible and annoying(3), annoying but not objectionable(2), or very annoying and objectionable(1). The numbers in parentheses are used to assign a numerical value to the subjectiv evaluations, and the numerical ratings of all listeners are averaged to produce a MOS for the coder. A MOS between 4.0 and 4.5 usually indicates high quality. c It is important to compute the variance of MOS values. A large variance, which indicates an unreliable test, n occur because participants do not known what categories such as good and bad mean. It is sometimes useful to present examples of good and bad speech to the listeners before the test to calibrate the 5-point scale Papamichalis, 1987 ]. The MOS values for a variety of speech coders and noise conditions are given in[Daumer, DAM The diagnostic acceptability measure(DAm)developed by Dynastat Voiers, 1977] is an attempt to make the measurement of speech quality more systematic. For the DAM, it is critical that the listener crews be highly trained and repeatedly calibrated in order to get meaningful results. The listeners are each presented with encoded sentences taken from the Harvard 1965 list of phonetically balanced sentences, such as"Cats and dogs c 2000 by CRC Press LLC
© 2000 by CRC Press LLC the glottal excitation. MBE coders estimate the fundamental frequency and use this estimate to focus subband decompositions on harmonics. MELP coders perform pitch-synchronous excitation modeling. Overall coder performance is enhanced in the CELP and MBE coders with the use of sub-integer lags [Kroon and Atal, 1991]. This is equivalent to performing pitch estimation using a signal sampled at a higher sampling rate to improve the precision of the spectral estimate. Highly precise glottal frequency estimation improves the “naturalness” of coded speech at the expense of increased computational complexity, and in some cases increased bit rate. Accurate characterization of pitch and LPC parameters can also be used to good advantage in postfiltering to reduce apparent quantization noise. These filters, usually derived from forward-adapted filter coefficients transmitted to the receiver as side-information, perform post-processing on the reconstructed speech which reduces perceptually annoying noise components [Chen and Gersho, 1995]. Speech Quality and Intelligibility To compare the performance of two speech coders, it is necessary to have some indicator of the intelligibility and quality of the speech produced by each coder. The term intelligibility usually refers to whether the output speech is easily understandable, while the term quality is an indicator of how natural the speech sounds. It is possible for a coder to produce highly intelligible speech that is low quality in that the speech may sound very machine-like and the speaker is not identifiable. On the other hand, it is unlikely that unintelligible speech would be called high quality, but there are situations in which perceptually pleasing speech does not have high intelligibility. We briefly discuss here the most common measures of intelligibility and quality used in formal tests of speech coders. DRT The diagnostic rhyme test (DRT) was devised by Voiers [1977] to test the intelligibility of coders known to produce speech of lower quality. Rhyme tests are so named because the listener must determine which consonant was spoken when presented with a pair of rhyming words; that is, the listener is asked to distinguish between word pairs such as meat-beat, pool-tool, saw-thaw, and caught-taught. Each pair of words differs on only one of six phonemic attributes: voicing, nasality, sustention, sibilation, graveness, and compactness. Specifically, the listener is presented with one spoken word from a pair and asked to decide which word was spoken. The final DRT score is the percent responses computed according to P = (R – W) ¥ 100, where R is the number correctly chosen, W is the number of incorrect choices, and T is the total of word pairs tested. Usually, 75 £ DRT £ 95, with a good being about 90 [Papamichalis, 1987]. MOS The Mean Opinion Score (MOS) is an often-used performance measure [Jayant and Noll, 1984]. To establish a MOS for a coder, listeners are asked to classify the quality of the encoded speech in one of five categories: excellent (5), good (4), fair (3), poor (2), or bad (1). Alternatively, the listeners may be asked to classify the coded speech according to the amount of perceptible distortion present, i.e., imperceptible (5), barely perceptible but not annoying (4), perceptible and annoying (3), annoying but not objectionable (2), or very annoying and objectionable (1). The numbers in parentheses are used to assign a numerical value to the subjective evaluations, and the numerical ratings of all listeners are averaged to produce a MOS for the coder. A MOS between 4.0 and 4.5 usually indicates high quality. It is important to compute the variance of MOS values. A large variance, which indicates an unreliable test, can occur because participants do not known what categories such as good and bad mean. It is sometimes useful to present examples of good and bad speech to the listeners before the test to calibrate the 5-point scale [Papamichalis, 1987]. The MOS values for a variety of speech coders and noise conditions are given in [Daumer, 1982]. DAM The diagnostic acceptability measure (DAM) developed by Dynastat [Voiers, 1977] is an attempt to make the measurement of speech quality more systematic. For the DAM, it is critical that the listener crews be highly trained and repeatedly calibrated in order to get meaningful results. The listeners are each presented with encoded sentences taken from the Harvard 1965 list of phonetically balanced sentences, such as “Cats and dogs 1 T --
TABLE 15.1 Speech Coder Performance Comparisons Standardization Subjective rony Identifier kbits/s MOS DRT DAM TU-l 4.3 TU-T G.721 G.728 RPE-LTP GSM GSM 26B 105 VSELP CTIA CELP US. DOD FS-1016483.13b90.765.4b IMBE Inmarsat LPC-10 U.S. DoD FS-1015 224b862b a Estimated. From results of 1996 U.S. DoD 2400 bits/s vocoder competition. each hate the other"and"The pipe began to rust while new. The listener is asked to assign a number between and 100 to characteristics in three classifications-signal qualities, background qualities, and total effect. The ratings of each characteristic are weighted and used in a multiple nonlinear regression. Finally, adjustments are made to compensate for listener performance. a typical DAM score is 45 to 55%, with 50% corresponding to a good system [Papamichalis, 1987] The perception of"good quality speech is a highly individual and subjective area. As such, no single performance measure has gained wide acceptance as an indicator of the quality and intelligibility of speech produced by a coder. Further, there is no substitute for subjective listening tests under the actual environmental conditions expected in a particular application. As a rough guide to the performance of some of the coders discussed here, we present the DRT, DAM, and MOS values in Table 15.1, which is adapted from [Spanias 1994; Jayant, 1990]. From the table, it is evident that at 8 kbits/s and above, performance is quite good and that the 4.8 kbits/s CElP has substantially better performance than LPC-10e. Standardization The presence of international, national, and regional speech coding standards ensures the interoperability of coders among various implementations. As noted previously, several standard algorithms exist among the classes of speech coders. The ITU-T( formerly CCiTT) has historically been a dominant factor in international tandardization of speech coders, such as G711, G.721, G728, G 729, etc. Additionally, the emergence of digital cellular telephony, personal communications networks, and multimedia communications has driven the for- mulation of various national or regional standard algorithms such as the gSm full and half-rate standards for European digital cellular, the CTIA full-rate TDMA and CDMa algorithms and their half-rate counterparts for U.S. digital cellular, full and half-rate Pitch-Synchronous CELP [Miki et al., 1993] for Japanese cellular, as well as speech coders for particular applications [ITU-TS, 1991] The standardization efforts of the U.S. Federal Government for secure voice channels and military applica tions have a historically significant impact on the evolution of speech coder technology. In particular, the recent re-standardization of the DoD 2400 bits/s vocoder algorithm has produced some competing algorithms worthy of mention here. Of the classes of speech coders represented among the algorithms competing to replace LPC-10 everal implementations utilized STC or MBE architectures, some used CELP architectures, and others were novel combinations of multiband-excitation with LPC modeling [ McCree and Barnwell, 1995] or pitch synchronous prototype waveform interpolation techniques (Kleijn, 1991] The final results of the U.S. DoD standard competition are summarized in Table 15.2 for" quiet "and"office environments. In the table, the column labeled"FOM"is the overall Figure of Merit used by the DoD Digital Voice Processing Consortium in selecting the coder. The FOM is a unitless combination of complexity and performance components, and is measured with respect to FS-1016. The complexity of a coder is a weighted combination of memory and processing power required. The performance of a coder is a weighted combination of four factors: quality(Q--measured via MOS), intelligibility(I--measured via DRT), speaker recognition(R) and communicability(C). Recognizability and communicability for each coder were measured by tests c 2000 by CRC Press LLC
© 2000 by CRC Press LLC each hate the other” and “The pipe began to rust while new”. The listener is asked to assign a number between 0 and 100 to characteristics in three classifications—signal qualities, background qualities, and total effect. The ratings of each characteristic are weighted and used in a multiple nonlinear regression. Finally, adjustments are made to compensate for listener performance. A typical DAM score is 45 to 55%, with 50% corresponding to a good system [Papamichalis, 1987]. The perception of “good quality” speech is a highly individual and subjective area. As such, no single performance measure has gained wide acceptance as an indicator of the quality and intelligibility of speech produced by a coder. Further, there is no substitute for subjective listening tests under the actual environmental conditions expected in a particular application. As a rough guide to the performance of some of the coders discussed here, we present the DRT, DAM, and MOS values in Table 15.1, which is adapted from [Spanias, 1994; Jayant, 1990]. From the table, it is evident that at 8 kbits/s and above, performance is quite good and that the 4.8 kbits/s CELP has substantially better performance than LPC-10e. Standardization The presence of international, national, and regional speech coding standards ensures the interoperability of coders among various implementations.As noted previously, several standard algorithms exist among the classes of speech coders. The ITU-T (formerly CCITT) has historically been a dominant factor in international standardization of speech coders, such as G.711, G.721, G.728, G.729, etc. Additionally, the emergence of digital cellular telephony, personal communications networks, and multimedia communications has driven the formulation of various national or regional standard algorithms such as the GSM full and half-rate standards for European digital cellular, the CTIA full-rate TDMA and CDMA algorithms and their half-rate counterparts for U.S. digital cellular, full and half-rate Pitch-Synchronous CELP [Miki et al., 1993] for Japanese cellular, as well as speech coders for particular applications [ITU-TS, 1991]. The standardization efforts of the U.S. Federal Government for secure voice channels and military applications have a historically significant impact on the evolution of speech coder technology. In particular, the recent re-standardization of the DoD 2400 bits/s vocoder algorithm has produced some competing algorithms worthy of mention here. Of the classes of speech coders represented among the algorithms competing to replace LPC-10, several implementations utilized STC or MBE architectures, some used CELP architectures, and others were novel combinations of multiband-excitation with LPC modeling [McCree and Barnwell, 1995] or pitchsynchronous prototype waveform interpolation techniques [Kleijn, 1991]. The final results of the U.S. DoD standard competition are summarized in Table 15.2 for “quiet” and “office” environments. In the table, the column labeled “FOM” is the overall Figure of Merit used by the DoD Digital Voice Processing Consortium in selecting the coder. The FOM is a unitless combination of complexity and performance components, and is measured with respect to FS-1016. The complexity of a coder is a weighted combination of memory and processing power required. The performance of a coder is a weighted combination of four factors: quality (Q—measured via MOS), intelligibility (I—measured via DRT), speaker recognition (R), and communicability (C). Recognizability and communicability for each coder were measured by tests TABLE 15.1 Speech Coder Performance Comparisons Algorithm Standardization Rate Subjective (acronym) Body Identifier kbits/s MOS DRT DAM m-law PCM ITU-T G.711 64 4.3 95 73 ADPCM ITU-T G.721 32 4.1 94 68 LD-CELP ITU-T G.728 16 4.0 94a 70a RPE-LTP GSM GSM 13 3.5 — — VSELP CTIA IS-54 8 3.5 — — CELP U.S. DoD FS-1016 4.8 3.13b 90.7b 65.4b IMBE Inmarsat IMBE 4.1 3.4 — — LPC-10e U.S. DoD FS-1015 2.4 2.24b 86.2b 50.3b a Estimated. b From results of 1996 U.S. DoD 2400 bits/s vocoder competition
TABLE 15.2 Speech Coder Performance Comparisons Taken from Results of 1996 U.S. DoD 2400 bits/s Vocoder Competition Algorithm (acronym) FOM Rank Best MOS DRT DAM MOS DRT DAM MELP 3.3092.364.5 96912 2347 Q3.2890.570.0 2.026 R3.0889963.828291.554.1 IMBE C28991462327 CELP 8989.056 LPC-10e-9.19 0985.2 Ineligible due to failure of the quality(MOS) criteria minimum requirements(better than CELP) in both quiet and office environments comparing processed vs unprocessed data, and effectiveness of communication in application-specific coop- erative tasks [Schmidt-Nielsen and Brock, 1996; Kreamer and Tardelli, 1996]. The MOS and DRT scores were measured in a variety of common DoD environments. Each of the four"finalist" coders ranked first in one of the four categories examined(Q, I, R, C), as noted in the table. The results of the standardization process were announced in April, 1996. As indicated in Table 15. 2, the replacing a version Prediction(MELP) coder which uses several specific enhancements to the basic MELP architecture. These enhancements include multi-stage vQ of the formant parameters based on frequency-weighted bark-sc pectral distortion, direct vQ of the first 10 Fourier coefficients of the excitation using bark-weighted distortion and a gain coding technique which is robust to channel errors [McCree et al., 1996 Variable Rate Coding Previous standardization efforts and discussion here have centered on fixed-rate coding of speech where a fixed number of bits are used to represent speech in digital form per unit of time. However, with recent developments in transmission architectures(such as CDMA), the implementation of variable-rate speech coding algorithms has become feasible. In variable-rate coding, the average data rate for conversational speech can be reduced by a factor of at least 2. A variable-rate speech coding algorithm has been standardized by the CTIa for wideband(CDMA) digital mobile cellular telephony under IS-95. The algorithm, QCELP [Gardner et al., 1993), is the first practical variable-rate speech coder to be incorporated in a digital cellular system. QCELP is a multi-mode, CELP-type analysis-by-synthesis coder which uses blockwise spectral energy measurements and a finite-state machine to switch between one of four configurations. Each configuration has a fixed rate of 1, 2, 4, or 8 kbits/s with a predetermined allocation of bits among coder parameters(coefficients, gains, excitation, etc. ) The subjective performance of QCELP in the presence of low background noise is quite good since the bit allocations pe ode and mode-switching logic are well-suited to high-quality speech. In fact, QCELP at an average rate of 4 kbits/s has been judged to be MOS-equivalent to VSELP, its 8 kbits/s, fixed-rate cellular counterpart. A time- ged encoding rate of 4 to 5 kbits/s is not uncommon for QCELP, however the average rate tends toward the 8 kbits/s maximum in the presence of moderate ambient noise. The topic of robust fixed-rate and variable rate speech coding in the presence of significant background noise remains an open problem Much recent research in speech coding below 8 kbits/s has focused on multi-mode CELP architectures and efficient approaches to source-controlled mode selection [Das et al., 1995]. Multimode coders are able to quickly invoke a coding scheme and bit allocation specifically tailored to the local characteristics of the speech signal This capability has proven useful in optimizing perceptual quality at low coding rates. In fact, the majority of algorithms currently proposed for half-rate European and U.S. digital cellular standards, as well as many algo- ithms considered for rates below 2.4 kbits/s are multimode coders. The direct coupling between variable-rate (multimode)speech coding and the CDMA transmission architecture is an obvious benefit to both technologies. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC comparing processed vs. unprocessed data, and effectiveness of communication in application-specific cooperative tasks [Schmidt-Nielsen and Brock, 1996; Kreamer and Tardelli, 1996]. The MOS and DRT scores were measured in a variety of common DoD environments. Each of the four “finalist” coders ranked first in one of the four categories examined (Q,I,R,C), as noted in the table. The results of the standardization process were announced in April, 1996. As indicated in Table 15.2, the new 2400 bits/s Federal Standard vocoder replacing LPC-10e is a version of the Mixed Excitation Linear Prediction (MELP) coder which uses several specific enhancements to the basic MELP architecture. These enhancements include multi-stage VQ of the formant parameters based on frequency-weighted bark-scale spectral distortion, direct VQ of the first 10 Fourier coefficients of the excitation using bark-weighted distortion, and a gain coding technique which is robust to channel errors [McCree et al., 1996]. Variable Rate Coding Previous standardization efforts and discussion here have centered on fixed-rate coding of speech where a fixed number of bits are used to represent speech in digital form per unit of time. However, with recent developments in transmission architectures (such as CDMA), the implementation of variable-rate speech coding algorithms has become feasible. In variable-rate coding, the average data rate for conversational speech can be reduced by a factor of at least 2. A variable-rate speech coding algorithm has been standardized by the CTIA for wideband (CDMA) digital mobile cellular telephony under IS-95. The algorithm, QCELP [Gardner et al., 1993], is the first practical variable-rate speech coder to be incorporated in a digital cellular system. QCELP is a multi-mode, CELP-type analysis-by-synthesis coder which uses blockwise spectral energy measurements and a finite-state machine to switch between one of four configurations. Each configuration has a fixed rate of 1, 2, 4, or 8 kbits/s with a predetermined allocation of bits among coder parameters (coefficients, gains, excitation, etc.). The subjective performance of QCELP in the presence of low background noise is quite good since the bit allocations permode and mode-switching logic are well-suited to high-quality speech. In fact, QCELP at an average rate of 4 kbits/s has been judged to be MOS-equivalent to VSELP, its 8 kbits/s, fixed-rate cellular counterpart. A timeaveraged encoding rate of 4 to 5 kbits/s is not uncommon for QCELP, however the average rate tends toward the 8 kbits/s maximum in the presence of moderate ambient noise. The topic of robust fixed-rate and variablerate speech coding in the presence of significant background noise remains an open problem. Much recent research in speech coding below 8 kbits/s has focused on multi-mode CELP architectures and efficient approaches to source-controlled mode selection [Das et al., 1995]. Multimode coders are able to quickly invoke a coding scheme and bit allocation specifically tailored to the local characteristics of the speech signal. This capability has proven useful in optimizing perceptual quality at low coding rates. In fact, the majority of algorithms currently proposed for half-rate European and U.S. digital cellular standards, as well as many algorithms considered for rates below 2.4 kbits/s are multimode coders. The direct coupling between variable-rate (multimode) speech coding and the CDMA transmission architecture is an obvious benefit to both technologies. TABLE 15.2 Speech Coder Performance Comparisons Taken from Results of 1996 U.S. DoD 2400 bits/s Vocoder Competition Algorithm Quiet Office (acronym) FOM Rank Best MOS DRT DAM MOS DRT DAM MELP 2.616 1 I 3.30 92.3 64.5 2.96 91.2 52.7 PWI 2.347 2 Q 3.28 90.5 70.0 2.88 88.4 55.5 STC 2.026 3 R 3.08 89.9 63.8 2.82 91.5 54.1 IMBE 2.991 * C 2.89 91.4 62.3 2.71 91.1 52.4 CELP 0.0 N/A — 3.13 90.7 65.4 2.89 89.0 56.1 LPC-10e –9.19 N/A — 2.24 86.2 50.3 2.09 85.2 48.4 * Ineligible due to failure of the quality (MOS) criteria minimum requirements (better than CELP) in both quiet and office environments
Summary and Conclusions The availability of general-purpose and application-specific digital signal processing chips and the ever-widening interest in digital communications have led to an increasing demand for speech coders. The worldwide desire to establish standards in a host of applications is a primary driving force for speech coder research and development. The speech coders that are available today for operation at 16 kbits/s and below are conceptually quite exotic compared with products available less than 10 years ago. The re-standardization of U.S. Federal Standard 1015(LPC-10)at 2.4 kbits/s with performance constraints similar to those of FS-1016 at 4.8 kbits is an indicator of the rapid evolution of speech coding paradigms and vlSI architectures. ther standards to be established in the near term include the European( GSM)and U.S(CTIA) half-rate speech coders for digital cellular mobile radio. For the longer term, the specification of standards for forth- coming mobile personal communications networks will be a primary focus in the next 5 to 10 years In the preface to their book, Jayant and Noll [ 1984] state that"our understanding of speech and image coding has now reached a very mature point.. As of 1997, this statement rings truer than ever. The field is a dyna one, however, and the wide range of commercial applications demands continual progress. Defining Terms Analysis-by-synthesis: Constructing several versions of a waveform and choosing the best match. Predictive coding: Coding of time-domain waveforms based on a(usually) linear prediction model Frequency domain coding: Coding of frequency-domain characteristics based on a discrete time-frequency transform Hybrid coders: Coders that fall between waveform coders and vocoders in how they select the excitation Standard: An encoding technique adopted by an industry to be used in a particular application. Mean Opinion Score(MOS): A popular method for classifying the quality of encoded speech based on a five oint scale Variable-rate coders: Coders that output different amounts of bits based on the time-varying characteristics of the source Related Topics 17. 1 Digital Image Processing. 21.4 Example 3: Multirate Signal Pr References Proc. IEEE A. Gersho, "Advances in speech and audio compression, " Proc. IEEE, 82, June 1994 w. B Kleijn and KK Paliwal, Eds, Speech Coding and Synthesis, Amsterdam, Holland: Elsevier, 1995 CCITT,32-kbit/s adaptive differential pulse code modulation(ADPCM), Red Book, II1.3, 125-159,1984 National Communications System, Office of Technology and Standards, Federal Standard 1015: Analog to Digital Conversion of Voice by 2400 bit/second Linear Predictive Coding, 1984 J.-H. Chen, High-quality 16 kb/s speech coding with a one-way delay less than 2 ms," Proc. IEEE Int. Conf Acoust, Speech, Signal Processing, Albuquerque, NM, Pp. 453-456, April 1990. Tational Communications System, Office of Technology and Standards, Federal Standard 1016: Telecommunications Analog to Digital Conversion of Radio Voice by 4800 bit/second Code Excited Linear Prediction(CELP), 1991 J. Gibson, Adaptive prediction for speech encoding, IEEE ASSP Magazine, 1, 12-26, July 1984. J. D. Johnston, "A filter family designed for use in quadrature mirror filter banks, " Proc. IEEE Int. Conf Acoust Speech, Signal Processing, Denver, CO, PP. 291-294, April 1980 B Atal and M. Schroeder, Predictive coding of speech signals and subjective error criteria, "IEEE Trans. Acoust Speech, Signal Processing, ASSP-27, 247-254, June 1979 Gerson and M. Jasiuk, Vector sum excited linear prediction(VSELP)speech coding at 8 kb/s, in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, PP. 461-464, April 1990 c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Summary and Conclusions The availability of general-purpose and application-specific digital signal processing chips and the ever-widening interest in digital communications have led to an increasing demand for speech coders. The worldwide desire to establish standards in a host of applications is a primary driving force for speech coder research and development. The speech coders that are available today for operation at 16 kbits/s and below are conceptually quite exotic compared with products available less than 10 years ago. The re-standardization of U.S. Federal Standard 1015 (LPC-10) at 2.4 kbits/s with performance constraints similar to those of FS-1016 at 4.8 kbits/s is an indicator of the rapid evolution of speech coding paradigms and VLSI architectures. Other standards to be established in the near term include the European (GSM) and U.S. (CTIA) half-rate speech coders for digital cellular mobile radio. For the longer term, the specification of standards for forthcoming mobile personal communications networks will be a primary focus in the next 5 to 10 years. In the preface to their book, Jayant and Noll [1984] state that “our understanding of speech and image coding has now reached a very mature point …” As of 1997, this statement rings truer than ever. The field is a dynamic one, however, and the wide range of commercial applications demands continual progress. Defining Terms Analysis-by-synthesis: Constructing several versions of a waveform and choosing the best match. Predictive coding: Coding of time-domain waveforms based on a (usually) linear prediction model. Frequency domain coding: Coding of frequency-domain characteristics based on a discrete time-frequency transform. Hybrid coders: Coders that fall between waveform coders and vocoders in how they select the excitation. Standard: An encoding technique adopted by an industry to be used in a particular application. Mean Opinion Score (MOS): A popular method for classifying the quality of encoded speech based on a fivepoint scale. Variable-rate coders: Coders that output different amounts of bits based on the time-varying characteristics of the source. Related Topics 17.1 Digital Image Processing • 21.4 Example 3: Multirate Signal Processing References A. S. Spanias, “Speech coding: A tutorial review,” Proc. IEEE, 82, 1541–1575, October 1994. A. Gersho, “Advances in speech and audio compression,” Proc. IEEE, 82, June 1994. W. B. Kleijn and K. K. Paliwal, Eds., Speech Coding and Synthesis, Amsterdam, Holland: Elsevier, 1995. CCITT, “32-kbit/s adaptive differential pulse code modulation (ADPCM),” Red Book, III.3, 125–159, 1984. National Communications System, Office of Technology and Standards, Federal Standard 1015: Analog to Digital Conversion of Voice by 2400 bit/second Linear Predictive Coding, 1984. J.-H. Chen, “High-quality 16 kb/s speech coding with a one-way delay less than 2 ms,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, pp. 453–456, April 1990. National Communications System, Office of Technology and Standards, Federal Standard 1016: Telecommunications: Analog to Digital Conversion of Radio Voice by 4800 bit/second Code Excited Linear Prediction (CELP), 1991. J. Gibson, “Adaptive prediction for speech encoding,” IEEE ASSP Magazine, 1, 12–26, July 1984. J. D. Johnston, “A filter family designed for use in quadrature mirror filter banks,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Denver, CO, pp. 291–294, April 1980. B. Atal and M. Schroeder, “Predictive coding of speech signals and subjective error criteria,” IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 247–254, June 1979. I. Gerson and M. Jasiuk, “Vector sum excited linear prediction (VSELP) speech coding at 8 kb/s,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Albuquerque, NM, pp. 461–464, April 1990