LL Int Con. Acoust, Speech s parison of some algebraic structures for CELP coding of speech,"Proc.IEEE J.-P. Adoul and C Lamblin, "A col lal Processing, Dallas, TX, PP 1953-1956, April 1987 S. Wang and A. Gersho, " Phonetically-based vector excitation coding of speech at 3.6 kbps, Proc. IEEE Int Conf. Acoust., Speech, Signal Processing, Glasgow, Scotland, Pp 49-52, May 1989 E Paksoy, K. Srinivasan, and A. Gersho, Variable rate speech coding with phonetic segmentation, Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Minneapolis, MN, Pp IL. 155-1L. 158, April 1993 J. Hardwick and J. Lim, The application of the IMBE speech coder to mobile communications, Proc. IEEE Int. Conf. Acoust, Speech, Signal Processing, pp. 249-252, May 1991 R. McAulay and T. Quatieri, " Speech analysis/synthesis based on a sinusoidal rep ion, IEEE Trans. Acoust, Speech, Signal Processing, 34, 744-754, August 1986 A. McCree and T Barnwell, A mixed excitation LPC vocoder model for low bit rate speech coding, "IEEE Trans. Speech Audio Processing. 3, 242-250, July 199 P. Kroon and B. S. Atal, On improving the performance of pitch predictors in speech coding systems," in Advances in Speech Coding, B. S. Atal, V. Cuperman, and A. Gersho, Eds, Boston, Mass: Kluwer, 1991 321-327 J.-H. Chen and A Gersho, " Adaptive postfiltering for quality enhancement of coded speech, "IEEE Trans. Speech and Audio Processing, 3, 59-71, January 1995. W. Voiers, " Diagnostic evaluation of speech intelligibility, "in Speech Intelligibility and Recognition, M. Hawley, Ed, Stroudsburg, Pa: Dowden, Hutchinson, and Ross, 197 P. Papamichalis, Practical Approaches to Speech Coding, Englewood Cliffs, N J: Prentice-Hall, 1987 N. S Jayant and P. Noll, Digital Coding of Waveforms, Englewood Cliffs, N J. Prentice-Hall, 1984 W. Daumer,Subjective evaluation of several different speech coders, " IEEE Trans. Commun, COM-30, 655-662. April 1982 w. Voiers, Diagnostic acceptability measure for speech communications systems, Proc IEEE Int. Conf Acoust Speech, Signal Processing, 204-207, 1977 N. Jayant, High-quality coding of telephone speech and wideband audio, IEEE Communications Magazin 8,10-20, January1990. S. Miki, K. Mano, H. Ohmuro, and T. Moriya, "Pitch synchronous innovation CELP(PSI-CELP), Proc European Conf Speech Comm. TechnoL, Berlin, Germany, pp. 261-264, September 1993 ITU-TS Study Group XV, Draft recommendation AV. 25Y-Dual Rate Speech Coder for Multimedia Telecommu nication Transmitting at 5.3 6.4 kbit/, December 1991 W. Kleijn, "Continuous representations in linear predictive coding, "Proc. IEEE Int Conf Acoust. Speech, Signal Processing, Pp 201-204, 1991 A Schmidt-Nielsen and D. Brock, Speaker recognizability testing for voice coders, "Proc. IEEE Int Conf. Acoust Speech, and Signal Processing, pp. 1149-1152, April 1996 E. Kreamer and J. Tardelli,"Communicability testing for voice coders, "Proc. IEEE Int. Conf. Acoust, Speech, Signal Processing, pp. 1153-1156, April 1996 A McCree, K. Truong, E. George, T. Barnwell, and V. Viswanathan, " A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard, Proc. IEEE Int Conf Acoust., Speech, Signal Processing, pp. 200-203, April 1996 w. Gardner, P. Jacobs, and C. Lee, "QCELP: A variable rate speech coder for CDMA digital cellular,"in Speech and Audio Coding for Wireless Networks, B S. Atal, V. Cuperman, and A. Gersho, Eds, Boston, Mass Kluwer,1993,pp.85-92. A. Das, E. Paksoy, and A. Gersho, "Multimode and variable-rate coding of speech,"in Speech Coding and Synthesis, W. B Kleijn and KK Paliwal, Eds, Amsterdam: Elsevier, 1995, pp. 257-288 Further information For further information on the state of the art in speech coding, see the articles by Spanias [1994] and gershe [1994], and the book Speech Coding and Synthesis by Kleijn and Paliwal [ 1995] c 2000 by CRC Press LLC
© 2000 by CRC Press LLC J.-P. Adoul and C. Lamblin, “A comparison of some algebraic structures for CELP coding of speech,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Dallas, TX, pp. 1953–1956, April 1987. S. Wang and A. Gersho, “Phonetically-based vector excitation coding of speech at 3.6 kbps,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Glasgow, Scotland, pp. 49–52, May 1989. E. Paksoy, K. Srinivasan, and A. Gersho, “Variable rate speech coding with phonetic segmentation,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Minneapolis, MN, pp. II.155–II.158, April 1993. J. Hardwick and J. Lim, “The application of the IMBE speech coder to mobile communications,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 249–252, May 1991. R. McAulay and T. Quatieri, “Speech analysis/synthesis based on a sinusoidal representation,” IEEE Trans. Acoust., Speech, Signal Processing, 34, 744–754, August 1986. A. McCree and T. Barnwell, “A mixed excitation LPC vocoder model for low bit rate speech coding,” IEEE Trans. Speech Audio Processing, 3, 242–250, July 1995. P. Kroon and B. S. Atal, “On improving the performance of pitch predictors in speech coding systems,” in Advances in Speech Coding, B. S. Atal, V. Cuperman, and A. Gersho, Eds., Boston, Mass: Kluwer, 1991, pp. 321–327. J.-H. Chen and A. Gersho, “Adaptive postfiltering for quality enhancement of coded speech,” IEEE Trans. Speech and Audio Processing, 3, 59–71, January 1995. W. Voiers, “Diagnostic evaluation of speech intelligibility,” in Speech Intelligibility and Recognition, M. Hawley, Ed., Stroudsburg, Pa.: Dowden, Hutchinson, and Ross, 1977. P. Papamichalis, Practical Approaches to Speech Coding, Englewood Cliffs, N.J.: Prentice-Hall, 1987. N. S. Jayant and P. Noll, Digital Coding of Waveforms, Englewood Cliffs, N.J.: Prentice-Hall, 1984. W. Daumer, “Subjective evaluation of several different speech coders,” IEEE Trans. Commun., COM-30, 655–662, April 1982. W. Voiers, “Diagnostic acceptability measure for speech communications systems,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 204–207, 1977. N. Jayant, “High-quality coding of telephone speech and wideband audio,” IEEE Communications Magazine, 28, 10–20, January 1990. S. Miki, K. Mano, H. Ohmuro, and T. Moriya, “Pitch synchronous innovation CELP (PSI-CELP),” Proc. European Conf. Speech Comm. Technol., Berlin, Germany, pp. 261–264, September 1993. ITU-TS Study Group XV, Draft recommendation AV.25Y—Dual Rate Speech Coder for Multimedia Telecommunication Transmitting at 5.3 & 6.4 kbit/s, December 1991. W. Kleijn, “Continuous representations in linear predictive coding,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 201–204, 1991. A. Schmidt-Nielsen and D. Brock, “Speaker recognizability testing for voice coders,” Proc. IEEE Int. Conf. Acoust., Speech, and Signal Processing, pp. 1149–1152, April 1996. E. Kreamer and J. Tardelli, “Communicability testing for voice coders,” Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 1153–1156, April 1996. A. McCree, K. Truong, E. George, T. Barnwell, and V. Viswanathan, “A 2.4 kbit/s MELP coder candidate for the new U.S. Federal Standard, Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 200–203, April 1996. W. Gardner, P. Jacobs, and C. Lee, “QCELP: A variable rate speech coder for CDMA digital cellular,” in Speech and Audio Coding for Wireless Networks, B. S. Atal, V. Cuperman, and A. Gersho, Eds., Boston, Mass.: Kluwer, 1993, pp. 85–92. A. Das, E. Paksoy, and A. Gersho, “Multimode and variable-rate coding of speech,” in Speech Coding and Synthesis, W. B. Kleijn and K. K. Paliwal, Eds., Amsterdam: Elsevier, 1995, pp. 257–288. Further Information For further information on the state of the art in speech coding, see the articles by Spanias [1994] and Gersho [1994], and the book Speech Coding and Synthesis by Kleijn and Paliwal [1995]
5.2 Speech Enhancement and Noise Reduction Yariv ephre Voice communication systems are susceptible to interfering signals normally referred to as noise. The interfering signals may have harmful effects on the performance of any speech communication system. These effects depend the specific system being used, on the nature of the noise and the way it interacts with the clean signal, and to-noise ratio(SNR), which is the ratio of the power of the signal to the power of the nose d by the signal- at of the signal. The latter The speech communication system may simply be a recording which was performed in a noisy environment, a standard digital or analog communication system, or a speech recognition system for human-machine communication. The noise may be present at the input of the communication system, in the channel, or at the eceiving end. The noise may be correlated or uncorrelated with the signal. It ny the clean signal in an additive, multiplicative, or any other more general manner. Examples of noise sources include competitive eech; background sounds like music, a fan, machines, door slamming, wind, and traffic; room reverberation nd white Gaussian channel noise The ultimate goal of speech enhancement is to minimize the effects of the noise on the performance of speech communication systems. The performance measure is system recordings of noisy speech, or standard analog communication systems, the goal of speech enhancement is to improve perceptual aspects of the noisy signal. For example, improving the quality and intelligibility of the noisy signal are common goals. Quality is a subjective measure which reflects on the pleasantness of the speech the amount of effort needed to understand the speech material. Intelligibility, on the other hand, is an b jective measure which signifies the amount of speech material correctly understood. For standard digital communication systems, the goal of speech enhancement is to improve perceptual aspects of the encoded speech signal. For human-machine speech communication systems, the goal of speech enhancement is to reduce the error ra Ite in recognizing the noisy speech signals To demonstrate the above ideas, consider a"hands-free"cellular radio telephone communication system. In this system, the transmitted signal is composed of the original speech and the background noise in the car. The background noise is generated by an engine, fan, traffic, wind, etc. The transmitted signal is also affected by the radio channel noise. As a result, noisy speech with low quality and intelligibility is delivered by such system The background noise may have additional devastating effects on the performance of this system. Specifically, if the system encodes the signal prior to its transmission, then the performance of the speech coder may ignificantly deteriorate in the presence of the noise. The reason is that speech coders rely on some statistical nodel for the clean signal, and this model becomes invalid when the signal is noisy. For a similar reason, if the cellular radio system is equipped with a speech recognizer for automatic dialing, then the error rate of such recognizer will be elevated in the presence of the background noise. The goals of speech enhancement in this example are to improve perceptual aspects of the transmitted noisy speech signals as well as to reduce the nizer error ra Other important applications of speech enhancement include improving the performance of: 1. Pay phones located in noisy environments(e.g, airports) 2. Air-ground communication systems in which the cockpit noise corrupts the pilot's speech 3. Teleconferencing systems where noise sources in one location may be broadcasted to all other locations 4. Long distance communication over noisy radio channels The problem of speech enhancement has been a challenge for many researchers for almost three decades Different solutions with various degrees of success have been proposed over the years. An excellent introduction to the problem, and review of the systems developed up until 1979, can be found in the landmark paper by Lim and Oppenheim [1979]. A panel of the National Academy of Sciences discussed in 1988 the problem and various ways to evaluate speech enhancement systems. The panels findings were summarized in Makhoul et al. [ 1989] Modern statistical approaches for speech enhancement were recently reviewed in Boll 1992] and Ephraim [1992] c 2000 by CRC Press LLC
© 2000 by CRC Press LLC 15.2 Speech Enhancement and Noise Reduction Yariv Ephraim Voice communication systems are susceptible to interfering signals normally referred to as noise.The interfering signals may have harmful effects on the performance of any speech communication system. These effects depend on the specific system being used, on the nature of the noise and the way it interacts with the clean signal, and on the relative intensity of the noise compared to that of the signal. The latter is usually measured by the signalto-noise ratio (SNR), which is the ratio of the power of the signal to the power of the noise. The speech communication system may simply be a recording which was performed in a noisy environment, a standard digital or analog communication system, or a speech recognition system for human-machine communication. The noise may be present at the input of the communication system, in the channel, or at the receiving end. The noise may be correlated or uncorrelated with the signal. It may accompany the clean signal in an additive, multiplicative, or any other more general manner. Examples of noise sources include competitive speech; background sounds like music, a fan, machines, door slamming, wind, and traffic; room reverberation; and white Gaussian channel noise. The ultimate goal of speech enhancement is to minimize the effects of the noise on the performance of speech communication systems. The performance measure is system dependent. For systems which comprise recordings of noisy speech, or standard analog communication systems, the goal of speech enhancement is to improve perceptual aspects of the noisy signal. For example, improving the quality and intelligibility of the noisy signal are common goals. Quality is a subjective measure which reflects on the pleasantness of the speech or on the amount of effort needed to understand the speech material. Intelligibility, on the other hand, is an objective measure which signifies the amount of speech material correctly understood. For standard digital communication systems, the goal of speech enhancement is to improve perceptual aspects of the encoded speech signal. For human-machine speech communication systems, the goal of speech enhancement is to reduce the error rate in recognizing the noisy speech signals. To demonstrate the above ideas, consider a “hands-free’’ cellular radio telephone communication system. In this system, the transmitted signal is composed of the original speech and the background noise in the car. The background noise is generated by an engine, fan, traffic, wind, etc. The transmitted signal is also affected by the radio channel noise. As a result, noisy speech with low quality and intelligibility is delivered by such systems. The background noise may have additional devastating effects on the performance of this system. Specifically, if the system encodes the signal prior to its transmission, then the performance of the speech coder may significantly deteriorate in the presence of the noise. The reason is that speech coders rely on some statistical model for the clean signal, and this model becomes invalid when the signal is noisy. For a similar reason, if the cellular radio system is equipped with a speech recognizer for automatic dialing, then the error rate of such recognizer will be elevated in the presence of the background noise. The goals of speech enhancement in this example are to improve perceptual aspects of the transmitted noisy speech signals as well as to reduce the speech recognizer error rate. Other important applications of speech enhancement include improving the performance of: 1. Pay phones located in noisy environments (e.g., airports) 2. Air-ground communication systems in which the cockpit noise corrupts the pilot’s speech 3. Teleconferencing systems where noise sources in one location may be broadcasted to all other locations 4. Long distance communication over noisy radio channels The problem of speech enhancement has been a challenge for many researchers for almost three decades. Different solutions with various degrees of success have been proposed over the years. An excellent introduction to the problem, and review of the systems developed up until 1979, can be found in the landmark paper by Lim and Oppenheim [1979]. A panel of the National Academy of Sciences discussed in 1988 the problem and various ways to evaluate speech enhancement systems. The panel’s findings were summarized in Makhoul et al. [1989]. Modern statistical approaches for speech enhancement were recently reviewed in Boll [1992] and Ephraim [1992]
In this section the principles and performance of the major speech enhancement approaches are reviewed and the advantages and disadvantages of each approach are discussed. The signal is assumed to be corrupted by additive statistically independent noise. Only a single noisy version of the clean signal is assumed available for enhancement. Furthermore, it is assumed that the clean signal cannot be preprocessed to increase its robustness prior to being affected by the noise Speech enhancement systems which can either preprocess the clean speech signal or which have access to multiple versions of the noisy signal obtained from a number of microphones are discussed in Lim [1983] This presentation is organized as follows. In the second section the speech enhancement problem is formu lated and commonly used models and performance measures are presented In the next section signal estimation for improving perceptual aspects of the noisy signal is discussed. In the fourth section coding techniques for noisy signals are summarized, and the last section deals with recognition of noisy speech signals. Due to the limited number of references(10)allowed in this publication, tutorial papers are mainly referenced. Appropriate credit will be given by pointing to the tutorial papers which reference the original papers. Models and Performance measures The goals of speech enhancement as stated in the first section are to improve perceptual aspects of the noisy ignal whether the signal is transmitted through analog or digital channels and to reduce the error rate in recognizing noisy speech signals. Improving perceptual aspects of the noisy signal can be accomplished by estimating the clean signal from the noisy signal using perceptually meaningful estimation performance mea ures. If the signal has to be encoded for transmission over digital channels, then source coding techniques can be applied to the given noisy signal. In this case, a perceptually meaningful fidelity measure between the clean signal and the encoded noisy signal must be used. Reducing error rate in speech communication systems can be accomplished by applying optimal signal classification approaches to the given noisy signals. Thus the speech enhancement problem is essentially a set of signal estimation, source coding, and signal classification problems The probabilistic approach for solving these problems requires explicit knowledge of the performance measure as well as the probability laws of the clean signal and noise process. Such knowledge, however, is not explicitly available. Hence, mathematically tractable performance measures and statistical models which are believed to be meaningful are used In this section we briefly review the most commonly used statistical models and performance measures. The most fundamental model for speech signals is the Gaussian autoregressive(Ar) model. This model assumes that each 20-to 40-msec segment of the signal is generated from an excitation signal which is applied to a linear time-invariant all-pole filter. The excitation signal comprises a mixture of white Gaussian noise and a periodic sequence of impulses. The period of that sequence is determined by the pitch period of the speech signal. This model is described in Fig. 15. 2. Generally, the excitation signal represents the flow of air through the vocal cords and the all-pole filter represents the vocal tract. The model for a given sample function of speech aussian white noise FIGURE 15.2 Gaussian autoregressive speech model. c 2000 by CRC Press LLC
© 2000 by CRC Press LLC In this section the principles and performance of the major speech enhancement approaches are reviewed, and the advantages and disadvantages of each approach are discussed. The signal is assumed to be corrupted by additive statistically independent noise. Only a single noisy version of the clean signal is assumed available for enhancement. Furthermore, it is assumed that the clean signal cannot be preprocessed to increase its robustness prior to being affected by the noise. Speech enhancement systems which can either preprocess the clean speech signal or which have access to multiple versions of the noisy signal obtained from a number of microphones are discussed in Lim [1983]. This presentation is organized as follows. In the second section the speech enhancement problem is formulated and commonly used models and performance measures are presented. In the next section signal estimation for improving perceptual aspects of the noisy signal is discussed. In the fourth section source coding techniques for noisy signals are summarized, and the last section deals with recognition of noisy speech signals. Due to the limited number of references (10) allowed in this publication, tutorial papers are mainly referenced. Appropriate credit will be given by pointing to the tutorial papers which reference the original papers. Models and Performance Measures The goals of speech enhancement as stated in the first section are to improve perceptual aspects of the noisy signal whether the signal is transmitted through analog or digital channels and to reduce the error rate in recognizing noisy speech signals. Improving perceptual aspects of the noisy signal can be accomplished by estimating the clean signal from the noisy signal using perceptually meaningful estimation performance measures. If the signal has to be encoded for transmission over digital channels, then source coding techniques can be applied to the given noisy signal. In this case, a perceptually meaningful fidelity measure between the clean signal and the encoded noisy signal must be used. Reducing error rate in speech communication systems can be accomplished by applying optimal signal classification approaches to the given noisy signals. Thus the speech enhancement problem is essentially a set of signal estimation, source coding, and signal classification problems. The probabilistic approach for solving these problems requires explicit knowledge of the performance measure as well as the probability laws of the clean signal and noise process. Such knowledge, however, is not explicitly available. Hence, mathematically tractable performance measures and statistical models which are believed to be meaningful are used. In this section we briefly review the most commonly used statistical models and performance measures. The most fundamental model for speech signals is the Gaussian autoregressive (AR) model. This model assumes that each 20- to 40-msec segment of the signal is generated from an excitation signal which is applied to a linear time-invariant all-pole filter. The excitation signal comprises a mixture of white Gaussian noise and a periodic sequence of impulses. The period of that sequence is determined by the pitch period of the speech signal. This model is described in Fig. 15.2. Generally, the excitation signal represents the flow of air through the vocal cords and the all-pole filter represents the vocal tract. The model for a given sample function of speech FIGURE 15.2 Gaussian autoregressive speech model
subsource subsource control FIGURE 15.3 Composite source model. signals, which is composed of several consecutive 20-to 40-msec segments of that signal, is obtained from the quence of AR models for the individual segments. Thus, a linear time-varying AR model is assumed for each sample function of the speech signal. This model, however, is slowly varying in accordance with the slo temporal variation of the articulatory system. It was found that a set of approximately 2048 prototype AR models can reliably represent all segments of speech signals. The ar models are useful in representing the short time spectrum of the signal, since the spectrum of the excitation signal is white. Thus, the set of Ar models represents a set of 2048 spectral prototypes for the speech signal The time-varying AR model for speech signals lacks the "memory which assigns preference to one AR model to follow another AR model. This memory could be incorporated, for example, by assuming that the individual AR models are chosen in a Markovian manner. That is, given an AR model for the current segment of speec certain AR models for the following segment of speech will be more likely than others. This results in the so- called composite source model( CSM)for the speech signal A block diagram of a CSM is shown in Fig. 15.3. In general, this model is composed of a set of m vector subsources which are controlled by a switch. The position of the switch at each time instant is chosen randomly, and the output of one subsource is provided. The position of the switch defines the state of the source at each time instant. CSMs for speech signals assume that the subsources are Gaussian AR sources, and the switch is controlled by a Markov chain. Furthermore, the subsources are usually assumed statistically independent and the vectors generated from each subsource are also assumed statistically independent. The resulting model is known as a hidden Markov model(HMM)[Rabiner, 1989] since the output of the model does not contain ne states of the markovian switch The performance measure for speech enhancement is task dependent. For signal estimation and coding, this measure is given in terms of a distortion measure between the clean signal and the estimated or the encoded signals, respectively. For signal classification applications the performance measure is normally the probability of misclassification. Commonly used distortion measures are the mean-squared error(Mse) and the Itakura aito distortion measures. The Itakura-Saito distortion measure is a measure between two power spectral densities, of which one is usually that of the clean signal and the other of a model for that signal [gersho and Gray, 1991]. This distortion measure is normally used in designing speech coding systems and it is believed to be perceptually meaningful. Both measures are mathematically tractable and lead to intuitive estimation and coding schemes. Systems designed using these two measures need not be optimal only in the MSE and the Itakura-Saito sense, but they may as well be optimal in other more meaningful senses(see a discussion in Ephraim [ 1992]) Signal Estimation In this section we review the major appi es for speech signal estimation given noisy c 2000 by CRC Press LLC
© 2000 by CRC Press LLC signals, which is composed of several consecutive 20- to 40-msec segments of that signal, is obtained from the sequence of AR models for the individual segments. Thus, a linear time-varying AR model is assumed for each sample function of the speech signal. This model, however, is slowly varying in accordance with the slow temporal variation of the articulatory system. It was found that a set of approximately 2048 prototype AR models can reliably represent all segments of speech signals. The AR models are useful in representing the short time spectrum of the signal, since the spectrum of the excitation signal is white. Thus, the set of AR models represents a set of 2048 spectral prototypes for the speech signal. The time-varying AR model for speech signals lacks the “memory’’ which assigns preference to one AR model to follow another AR model. This memory could be incorporated, for example, by assuming that the individual AR models are chosen in a Markovian manner. That is, given an AR model for the current segment of speech, certain AR models for the following segment of speech will be more likely than others. This results in the socalled composite source model (CSM) for the speech signal. A block diagram of a CSM is shown in Fig. 15.3. In general, this model is composed of a set of M vector subsources which are controlled by a switch. The position of the switch at each time instant is chosen randomly, and the output of one subsource is provided. The position of the switch defines the state of the source at each time instant. CSMs for speech signals assume that the subsources are Gaussian AR sources, and the switch is controlled by a Markov chain. Furthermore, the subsources are usually assumed statistically independent and the vectors generated from each subsource are also assumed statistically independent. The resulting model is known as a hidden Markov model (HMM) [Rabiner, 1989] since the output of the model does not contain the states of the Markovian switch. The performance measure for speech enhancement is task dependent. For signal estimation and coding, this measure is given in terms of a distortion measure between the clean signal and the estimated or the encoded signals, respectively. For signal classification applications the performance measure is normally the probability of misclassification. Commonly used distortion measures are the mean-squared error (MSE) and the ItakuraSaito distortion measures. The Itakura-Saito distortion measure is a measure between two power spectral densities, of which one is usually that of the clean signal and the other of a model for that signal [Gersho and Gray, 1991]. This distortion measure is normally used in designing speech coding systems and it is believed to be perceptually meaningful. Both measures are mathematically tractable and lead to intuitive estimation and coding schemes. Systems designed using these two measures need not be optimal only in the MSE and the Itakura-Saito sense, but they may as well be optimal in other more meaningful senses (see a discussion in Ephraim [1992]). Signal Estimation In this section we review the major approaches for speech signal estimation given noisy signals. FIGURE 15.3 Composite source model
discrete Inverse chanced Fourier speech transform FIGURE 15.4 ctral subtraction signal estimator Spectral Subtraction The spectral subtraction approach[ Weiss, 1974] is the simplest and most intuitive and popular speech enhance- nent approach. This approach provides estimates of the clean signal as well as of the short time spectrum of that signal. Estimation is performed on a frame-by-frame basis, where each frame consists of 20-40 msec of eech samples. In the spectral subtraction approach the signal is Fourier transformed, and spectral components whose variance is smaller than that of the noise are nulled. The surviving spectral components are modified by an appropriately chosen gain function. The resulting set of nulled and modified spectral components constitute the spectral components of the enhanced signal. The signal estimate is obtained from inverse Fourier transform of the enhanced spectral components. The short time spectrum estimate of the signal is obtained from squaring the enhanced spectral components. a block diagram of the spectral subtraction approach is shown in Fig. 15.4. Gain functions motivated by different perceptual aspects have been used. One of the most popular functions results from linear minimum MSE (MMSE)estimation of each spectral component of the clean signal given the corresponding spectral component of the noisy signal. In this case, the value of the gain function for a spectral component constitutes the ratio of the variances of the clean and noisy spectral comp variance of the clean spectral component is obtained by subtracting an assumed known variance of the spectral component from the variance of the noisy spectral component. The resulting variance anteed to be positive by the nulling process mentioned above. The variances of the spectral components of the noise process are normally estimated from silence portions of the noisy signal A family of spectral gain functions proposed in Lim and Oppenheim 1979] is given by [Z, l -bELLa (15.2) where Zn and Vn denote the nth spectral components of the noisy signal and the noise process, respectively, and a>0, b20,c>0. The MMSE gain function is obtained when a=2, b=1, and c= 1. Another commonly used gain function in the spectral subtraction approac tained from using a= 2, b=1, and c=1/2.This gain function results from estimating the spectral magnitude of the signal and combining the resulting estimate with the phase of the noisy signal. This choice of gain function is motivated by the relative importance of the ctral magnitude of the signal compared to its phase. Since both cannot be simultaneously optimally estimated complex exponential of the phase which does not affco ally estimated, and combined with an estimate of the [Ephraim, 1992), only the spectral magnitude is opt t the spectral magnitude estimate. The resulting estimate c 2000 by CRC Press LLC
© 2000 by CRC Press LLC Spectral Subtraction The spectral subtraction approach [Weiss, 1974] is the simplest and most intuitive and popular speech enhancement approach. This approach provides estimates of the clean signal as well as of the short time spectrum of that signal. Estimation is performed on a frame-by-frame basis, where each frame consists of 20–40 msec of speech samples. In the spectral subtraction approach the signal is Fourier transformed, and spectral components whose variance is smaller than that of the noise are nulled. The surviving spectral components are modified by an appropriately chosen gain function. The resulting set of nulled and modified spectral components constitute the spectral components of the enhanced signal. The signal estimate is obtained from inverse Fourier transform of the enhanced spectral components. The short time spectrum estimate of the signal is obtained from squaring the enhanced spectral components. A block diagram of the spectral subtraction approach is shown in Fig. 15.4. Gain functions motivated by different perceptual aspects have been used. One of the most popular functions results from linear minimum MSE (MMSE) estimation of each spectral component of the clean signal given the corresponding spectral component of the noisy signal. In this case, the value of the gain function for a given spectral component constitutes the ratio of the variances of the clean and noisy spectral components. The variance of the clean spectral component is obtained by subtracting an assumed known variance of the noise spectral component from the variance of the noisy spectral component. The resulting variance is guaranteed to be positive by the nulling process mentioned above. The variances of the spectral components of the noise process are normally estimated from silence portions of the noisy signal. A family of spectral gain functions proposed in Lim and Oppenheim [1979] is given by (15.2) where Zn and Vn denote the nth spectral components of the noisy signal and the noise process, respectively, and a > 0, b ³ 0, c > 0. The MMSE gain function is obtained when a = 2, b = 1, and c = 1. Another commonly used gain function in the spectral subtraction approach is obtained from using a = 2, b = 1, and c = 1/2. This gain function results from estimating the spectral magnitude of the signal and combining the resulting estimate with the phase of the noisy signal. This choice of gain function is motivated by the relative importance of the spectral magnitude of the signal compared to its phase. Since both cannot be simultaneously optimally estimated [Ephraim, 1992], only the spectral magnitude is optimally estimated, and combined with an estimate of the complex exponential of the phase which does not affect the spectral magnitude estimate. The resulting estimate FIGURE 15.4 Spectral subtraction signal estimator. g Z bE V Z n n N n a n a n a c = { } Ê Ë Á Á ˆ ¯ ˜ ˜ = * * * * * * – 1,