of the phase can be shown to be the phase of the noisy signal within the HMM statistical framework. Normally, the spectral subtraction approach is used with b= 2, which corresponds to an artificially elevated noise level The spectral subtraction approach has been very popular since it is relatively easy to implement; it makes minimal assumptions about the signal and noise; and when carefully implemented, it results in reasonably clear enhanced signals. A major drawback of the spectral subtraction enhancement approach, however, is that the residual noise has annoying tonal characteristics referred to as"musical noise. This noise consists of narrowband signals with time-varying frequencies and amplitudes. Another major drawback of the spectral subtraction approach is that its optimality in any given sense has never been proven. Thus, no systematic methodology for proving the performance of this approach has been developed, and all attempts to achieve this goal ha been based on purely heuristic arguments. As a result, a family of spectral subtraction speech enhancement approaches h In a recent work [Ephraim et al., 1995] a version of the spectral subtraction was shown to be a signal subspace estimation approach which is asymptomatically optimal(as the frame length approaches infinity)in the linear Empirical Averages This approach attempts to estimate the clean signal from the noisy signal in the MMSE sense. The conditional mean estimator is implemented using the conditional sample average of the clean signal given the noisy signal The sample average is obtained from appropriate training sequences of the clean and noisy signals. This is equivalent to using the sample distribution or the histogram estimate of the probability density function(pdf) of the clean signal given the noisy signal. The sample average approach is applicable for estimating the signal well as functionals of that signal, e.g., the spectrum, the logarithm of the spectrum, and the spectral Let Yn t=0,., T) be a training data from the clean signal, where Y, is a K-dimensional vector in the Euclidean space R. Let [Zp [=0,., T) be a training data from the noisy signal, where Zp, E RK. The sequence [2, can be obtained by adding a noise training sequence ivn t=0,.., T to the sequence of clean signals [Y. Let ze R be a vector of the noisy signal from which the vector y of the clean signal is estimated. Let Y(z)=Y: Z=t=0,., T be the set of all clean vectors from the training data of the clean signal which could have resulted in the given noisy observation z The cardinality of this set is denoted by Y(z).Then,the sample average estimate of the conditional mean of the clean signal y given the noisy signal z is given by y=El] yP p(, z)dy ∑Y Obviously, this approach is only applicable for signals with finite alphabet since otherwise the set Y(z) is empty with probability one For signals with continuous pdfs, the approach can be applied only if those signals are ppropriately quantized The sample average approach was first applied for enhancing speech signals by Porter and Boll in 1984[Boll, 1992]. They, however, considered a simpler situation in which the noise true pdf was assumed known. In this case,enhanced signals with residual noise characterized as being a blend of wideband noise and musical noise were obtained. The balance between the two types of residual noise depended on the functional of the clean ignal which was estimated. The advantages of the sample average approach are that it is conceptually simple and it does I priori assumptions about the form of the pdfs of the signal and noise. Hence, it is a nonparametric estimation pproach. This approach, however, has three major disadvantages. First, the estimator does not utilize any speech specific information such as the periodicity of the signal and the signal's ar model. Second, the training c 2000 by CRC Press LLC
© 2000 by CRC Press LLC of the phase can be shown to be the phase of the noisy signal within the HMM statistical framework. Normally, the spectral subtraction approach is used with b = 2, which corresponds to an artificially elevated noise level. The spectral subtraction approach has been very popular since it is relatively easy to implement; it makes minimal assumptions about the signal and noise; and when carefully implemented, it results in reasonably clear enhanced signals. A major drawback of the spectral subtraction enhancement approach, however, is that the residual noise has annoying tonal characteristics referred to as “musical noise.” This noise consists of narrowband signals with time-varying frequencies and amplitudes. Another major drawback of the spectral subtraction approach is that its optimality in any given sense has never been proven. Thus, no systematic methodology for improving the performance of this approach has been developed, and all attempts to achieve this goal have been based on purely heuristic arguments. As a result, a family of spectral subtraction speech enhancement approaches have been developed and experimentally optimized. In a recent work [Ephraim et al., 1995] a version of the spectral subtraction was shown to be a signal subspace estimation approach which is asymptomatically optimal (as the frame length approaches infinity) in the linear MMSE sense. Empirical Averages This approach attempts to estimate the clean signal from the noisy signal in the MMSE sense. The conditional mean estimator is implemented using the conditional sample average of the clean signal given the noisy signal. The sample average is obtained from appropriate training sequences of the clean and noisy signals. This is equivalent to using the sample distribution or the histogram estimate of the probability density function (pdf) of the clean signal given the noisy signal. The sample average approach is applicable for estimating the signal as well as functionals of that signal, e.g., the spectrum, the logarithm of the spectrum, and the spectral magnitude. Let {Yt , t = 0, . . ., T} be a training data from the clean signal, where Yt is a K-dimensional vector in the Euclidean space RK. Let {Zt, t = 0, . . ., T} be a training data from the noisy signal, where Zt , Œ RK. The sequence {Zt} can be obtained by adding a noise training sequence {Vt, t = 0, . . ., T} to the sequence of clean signals {Yt}. Let z Œ RK be a vector of the noisy signal from which the vector y of the clean signal is estimated. Let Y(z) = {Yt: Zt = z, t = 0, . . ., T} be the set of all clean vectors from the training data of the clean signal which could have resulted in the given noisy observation z. The cardinality of this set is denoted by *Y(z)*. Then, the sample average estimate of the conditional mean of the clean signal y given the noisy signal z is given by (15.3) Obviously, this approach is only applicable for signals with finite alphabet since otherwise the set Y(z) is empty with probability one. For signals with continuous pdf’s, the approach can be applied only if those signals are appropriately quantized. The sample average approach was first applied for enhancing speech signals by Porter and Boll in 1984 [Boll, 1992]. They, however, considered a simpler situation in which the noise true pdf was assumed known. In this case, enhanced signals with residual noise characterized as being a blend of wideband noise and musical noise were obtained. The balance between the two types of residual noise depended on the functional of the clean signal which was estimated. The advantages of the sample average approach are that it is conceptually simple and it does not require a priori assumptions about the form of the pdf’s of the signal and noise. Hence, it is a nonparametric estimation approach. This approach, however, has three major disadvantages. First, the estimator does not utilize any speech specific information such as the periodicity of the signal and the signal’s AR model. Second, the training ˆ { } ( , ) ( , ) ( ) ( ) y E y z yp y z dy p y z dy z Yt Yt z = = = Ú Ú Â { } Œ * * * 1 Y Y
(1,1 Pr(1,1)|2) hanced (M, M) Pr((M, M)Iz) FIGURE 15.5 HMM-based MMSE signal estimator sequences from the signal and noise must be available at the speech enhancement unit. Furthermore, these training sequences must be applied for each newly observed vector of the noisy signal. Since the training sequences are normally very long, the speech enhancement unit must have extensive memory and computational es. These problems are addressed in the model-based approach described next. Model-Based Approach The model-based approach [Ephraim, 1992] is a Bayesian approach for estimating the clean signal or any functional of that signal from the observed noisy signal. This approach assumes CSMs for the clean signal and noise process. The models are estimated from training sequences of those processes using the maximum likelihood(ML)estimation approach. Under ideal conditions the ML model estimate is consistent and asymp tically efficient. The ML model estimation is performed using the expectation-maximization(EM)or the Baum iterative algorithm [ Rabiner, 1989; Ephraim, 1992]. Given the CSMs for the signal and noise, the clean signal is estimated by minimizing the expected value of the chosen distortion measure. The model-based pproach uses significantly more statistical knowledge about the signal and noise compared to either the spectra subtraction or the sample average approache The MMSE signal estimator is obtained from the conditional mean of the clean signal given the noisy signal. If y e r denotes the vector of the speech signal at time t, and zo denotes the sequence of K-dimensional vectors of noisy signals (zo,..., Z, from time t=0 to t=t then the MMSe estimator of y, is given by yt= Elza ∑P(xz)Eyl where x, denotes the composite state of the noisy signal at time t. This state is given by x, 4(x,,x, ),where x is the markov state of the clean signal at time t and x, denotes the Markov state of the noise process at the ame time instant t The MMSE estimator, Eq (15.4), comprises a weighted sum of conditional mean estimators for the composite states of the noisy signal, where the weights are the probabilities of those states given the noisy observed signal. a block diagram of this estimator is shown in Fig. 15.5 The probability p(, z )can be efficiently calculated using the forward recursion associated with HMMs. For CSMs with Gaussian subsources, the conditional mean Eylzp x, is a linear function of the noisy vector Zr, given by c 2000 by CRC Press LLC
© 2000 by CRC Press LLC sequences from the signal and noise must be available at the speech enhancement unit. Furthermore, these training sequences must be applied for each newly observed vector of the noisy signal. Since the training sequences are normally very long, the speech enhancement unit must have extensive memory and computational resources. These problems are addressed in the model-based approach described next. Model-Based Approach The model-based approach [Ephraim, 1992] is a Bayesian approach for estimating the clean signal or any functional of that signal from the observed noisy signal. This approach assumes CSMs for the clean signal and noise process. The models are estimated from training sequences of those processes using the maximum likelihood (ML) estimation approach. Under ideal conditions the ML model estimate is consistent and asymptotically efficient. The ML model estimation is performed using the expectation-maximization (EM) or the Baum iterative algorithm [Rabiner, 1989; Ephraim, 1992]. Given the CSMs for the signal and noise, the clean signal is estimated by minimizing the expected value of the chosen distortion measure. The model-based approach uses significantly more statistical knowledge about the signal and noise compared to either the spectral subtraction or the sample average approaches. The MMSE signal estimator is obtained from the conditional mean of the clean signal given the noisy signal. If yt Œ RK denotes the vector of the speech signal at time t, and z t 0 denotes the sequence of K-dimensional vectors of noisy signals {z0 , . . ., zt} from time t = 0 to t = t, then the MMSE estimator of yt is given by (15.4) where – xt denotes the composite state of the noisy signal at time t. This state is given by – xt D = (xt , ~ xt), where xt is the Markov state of the clean signal at time t and ~ xt denotes the Markov state of the noise process at the same time instant t. The MMSE estimator, Eq. (15.4), comprises a weighted sum of conditional mean estimators for the composite states of the noisy signal, where the weights are the probabilities of those states given the noisy observed signal. A block diagram of this estimator is shown in Fig. 15.5. The probability P( – x t *zt 0 ) can be efficiently calculated using the forward recursion associated with HMMs. For CSMs with Gaussian subsources, the conditional mean E{yt *zt, – xt} is a linear function of the noisy vector zt, given by FIGURE 15.5 HMM-based MMSE signal estimator. ˆ { } ( ) { , } y E y z P x z E y z x t t t t x t t t t t = =  * * * 0 0
E(zp X, )=Sx(Sxt+ S)-lz,A Hi,, where Sx and Sit denote the covariance matrices of the gaussian subsources associated with the Markov states x, and Ar, respectively. Since, however, P(, z is a nonlinear function of the noisy signal zf, the MMSE signal estimator y, is a nonlinear function of the noisy signal zo The MMSE estimator, Eq (15.4), is intuitively appealing. It uses a predesigned set of filters (Hi obtained from training data of speech and noise. Each filter is optimal for a pair of subsources of the CSMs for the clean ignal and the noise process. Since each subsource represents a subset of signals from the corresponding source, each filter is optimal for a pair of signal subsets from the speech and noise. The set of predesigned filters covers all possible pairs of speech and noise signal subsets. Hence, for each noisy vector of speech there must exist an timal filter in the set of predesigned filters. Since, however, a vector of the noisy signal could possibly be generated from any pair of subsources of the clean signal and noise, the most appropriate filter for a giver noisy vector is not known. Consequently, in estimating the signal vector at each time instant, all filters are tried and their outputs are weighted by the probabilities of the filters to be correct for the given noisy signal. Other strategies for utilizing the predesigned set of filters are possible. For example, at each time instant only the most ly filter can be applied to the noisy signal. This approach is more intuitive than that of the MMSE estimation. It was first proposed in Drucker [1968] for a five-state model which comprises subsources for fricatives, stops, vowels, glides, and nasals. This approach was shown by Ephraim and Merhav [Ephraim, 1992] to be optimal only in an asymptotic MMSE sense The model-based MMSE approach provides reasonably good enhanced speech quality with significantly less structured residual noise than the spectral subtraction approach. This performance was achieved for white Gaussian input noise at 10 dB input SNR using 512-2048 filters. An improvement of 5-6 dB in SNR was achieved by this approach. The model-based approach, however, is more elaborate than the spectral subtraction approach, since it involves two steps of training and estimation, and training must be performed on sufficiently long data. The MMSE estimation approach is usually superior to the asymptotic MMSE enhancement approach The reason is that the MMSE approach applies a"soft decision"rather than a"hard decision "in choosing the most appropriate filter for a given vector of the noisy signal A two-state version of the MMSE estimator was first applied to speech enhancement by McAulay and Malpass in 1980[Ephraim, 1992]. The two states corresponded to speech presence and speech absence(silence)in the oisy observations. The estimator for the signal given that it is present in the noisy observations was imple- mented by the spectral subtraction approach. The estimator for the signal in the"silence state" is obviously equal to zero. This approach significantly improved the performance of the spectral subtraction approach Source Coding in encoder for the clean signal maps vectors of that signal onto a finite set of representative signal vectors referred to as codewords. The mapping is performed by assigning each signal vector to its nearest neighbor codeword. The index of the chosen codeword is transmitted to the receiver in a signal communication system and the signal is reconstructed using a copy of the chosen codeword. The codewords are designed to minimize the average distortion resulting from the nearest neighbor mapping. The codewords may simply represent waveform vectors of the signal. In another important application of low bit-rate speech coding, the codewords represent a set of parameter vectors of the AR model for the speech signal. Such coding systems synthesize the signal using the speech model in Fig. 15. 2. The synthesis is performed using the encoded vector of AR coefficients as well as the parameters of the excitation signal. Reasonably good speech quality can be obtained using this coding approach at rates as low as 2400-4800 bits/sample [Gersho and Gray, 1991] When only noisy signals are available for coding, the encoder operates on the noisy signal while representing the clean signal. In this case, the encoder is designed by minimizing the average distortion between the clean ignal and the encoded signal. Specifically, let y denote the vector of clean signal to be encoded. Let z denote he corresponding given vector of the noisy signal. Let g denote the encoder. Let d denote a distortion measure. Then, the optimal encoder is designed by c 2000 by CRC Press LLC
© 2000 by CRC Press LLC E(yt *zt, – x t) = Sxt(Sxt + S~ xt)–1zt D = Hx t ¯ zt (15.5) where Sxt and S~ xt denote the covariance matrices of the Gaussian subsources associated with the Markov states xt and – xt , respectively. Since, however, P( – xt *zt 0 ) is a nonlinear function of the noisy signal zt 0 , the MMSE signal estimator yˆt is a nonlinear function of the noisy signal zt 0. The MMSE estimator, Eq. (15.4), is intuitively appealing. It uses a predesigned set of filters {H– x t, } obtained from training data of speech and noise. Each filter is optimal for a pair of subsources of the CSMs for the clean signal and the noise process. Since each subsource represents a subset of signals from the corresponding source, each filter is optimal for a pair of signal subsets from the speech and noise. The set of predesigned filters covers all possible pairs of speech and noise signal subsets. Hence, for each noisy vector of speech there must exist an optimal filter in the set of predesigned filters. Since, however, a vector of the noisy signal could possibly be generated from any pair of subsources of the clean signal and noise, the most appropriate filter for a given noisy vector is not known. Consequently, in estimating the signal vector at each time instant, all filters are tried and their outputs are weighted by the probabilities of the filters to be correct for the given noisy signal. Other strategies for utilizing the predesigned set of filters are possible. For example, at each time instant only the most likely filter can be applied to the noisy signal. This approach is more intuitive than that of the MMSE estimation. It was first proposed in Drucker [1968] for a five-state model which comprises subsources for fricatives, stops, vowels, glides, and nasals. This approach was shown by Ephraim and Merhav [Ephraim, 1992] to be optimal only in an asymptotic MMSE sense. The model-based MMSE approach provides reasonably good enhanced speech quality with significantly less structured residual noise than the spectral subtraction approach. This performance was achieved for white Gaussian input noise at 10 dB input SNR using 512-2048 filters. An improvement of 5–6 dB in SNR was achieved by this approach. The model-based approach, however, is more elaborate than the spectral subtraction approach, since it involves two steps of training and estimation, and training must be performed on sufficiently long data. The MMSE estimation approach is usually superior to the asymptotic MMSE enhancement approach. The reason is that the MMSE approach applies a “soft decision” rather than a “hard decision” in choosing the most appropriate filter for a given vector of the noisy signal. A two-state version of the MMSE estimator was first applied to speech enhancement by McAulay and Malpass in 1980 [Ephraim, 1992]. The two states corresponded to speech presence and speech absence (silence) in the noisy observations. The estimator for the signal given that it is present in the noisy observations was implemented by the spectral subtraction approach. The estimator for the signal in the “silence state” is obviously equal to zero. This approach significantly improved the performance of the spectral subtraction approach. Source Coding An encoder for the clean signal maps vectors of that signal onto a finite set of representative signal vectors referred to as codewords. The mapping is performed by assigning each signal vector to its nearest neighbor codeword. The index of the chosen codeword is transmitted to the receiver in a signal communication system, and the signal is reconstructed using a copy of the chosen codeword. The codewords are designed to minimize the average distortion resulting from the nearest neighbor mapping. The codewords may simply represent waveform vectors of the signal. In another important application of low bit-rate speech coding, the codewords represent a set of parameter vectors of the AR model for the speech signal. Such coding systems synthesize the signal using the speech model in Fig. 15.2. The synthesis is performed using the encoded vector of AR coefficients as well as the parameters of the excitation signal. Reasonably good speech quality can be obtained using this coding approach at rates as low as 2400–4800 bits/sample [Gersho and Gray, 1991]. When only noisy signals are available for coding, the encoder operates on the noisy signal while representing the clean signal. In this case, the encoder is designed by minimizing the average distortion between the clean signal and the encoded signal. Specifically, let y denote the vector of clean signal to be encoded. Let z denote the corresponding given vector of the noisy signal. Let q denote the encoder. Let d denote a distortion measure. Then, the optimal encoder is designed by
min Eld(, q(z))) (15.6 When the clean signal is available for encoding the design problem is similarly defined, and it is obtained from q (15.6)using z=y. The design problem in Eq (15.6)is not standard since the encoder operates and represents different sources. The problem can be transformed into a standard coding problem by appropriately modifying the distortion measure. This was shown by Berger in 1971 and Ephraim and Gray in 1988[Ephraim, 1992] cifically, define the modified distortion measure by d'(z, g(z))A E(d(y, q(z))z) (15.7) Then, by using iterated expectation in Eq (15.8), the design problem becomes min Eld(z, q(z))) (15.8) A useful class of encoders for speech signals are those obtained from vector quantization. Vector quantizers re designed using the Lloyd algorithm Gersho and Gray, 1991]. This is an iterative algorithm in which the codewords and the nearest neighbor regio ns are alterna tively optimized. This algorithm can be applied to desig vector quantizers for clean and noisy signals using the modified distortion measure. The problem of designing vector quantizers for noisy signals is related to the problem of estimating the clean signals from the noisy signals, as was shown by Wolf and Ziv in 1970 and Ephraim and Gray in 1988 [Ephraim, 1992]. Specifically, optimal waveform vector quantizers in the MMSE sense can be designed by first estimatin he clean signal and then quantizing the estimated signal. Both estimation and quantization are performed in the MMSE sense. Similarly, optimal quantization of the vector of parameters of the AR model for the speech ignal in the Itakura-Saito sense can be performed in two steps of estimation and quantization. Specifically the autocorrelation function of the clean signal, which approximately constitutes the sufficient statistics of that signal for estimating the AR model, is first estimated in the MMSE sense. Then, optimal vector quantization in the itakura-Saito sense is applied to the estimated autocorrelation The estimation-quantization approach has been most popular in designing encoders for speech signals given oisy signals. Since such design requires explicit knowledge of the statistics of the clean signal and the noise process, but this knowledge is not available as argued in the second section, a variety of suboptimal encoders were proposed. Most of the research in this area focused on designing encoders for the ar model of the signal ue to the importance of such encoders in low bit-rate speech coding. The proposed encoders mainly differ in the estimators they used and the functionals of the speech signal these estimators have been applied to Important examples of functionals which have commonly been estimated include the signal waveform, autocorrelation and the spectral magnitude. The primarily set of estimators used for this application were obtained from the spectral subtraction approach and its derivatives. A version of the sample average estimator was also developed and applied to AR modeling by Juang and Rabiner in 1987 [Ephraim, 1992. Recently, the HMM-based estimator of the autocorrelation function of the clean signal was used in aR model vector quantization [Ephraim, 1992] en Designing of AR model-based encoders from noisy signals has been a very successful application of speech ancement. In this case both the quality and intelligibility of the encoded signal can be improved compared to the case where the encoder is designed for the clean signal and the input noise is simply ignored. The reason is that the input noise has devastating effects of the performance of AR model-based speech coders, and any easonable"estimation approach can significantly improve the performance of those coders in noisy environments Signal Classification In recognition of clean speech signals a sample function of the signal is associated with one of the words in the vocabulary. The association or decision rule is designed to minimize the probability of classification error. When only noisy speech signals are available for recognition a very similar problem results. Specifically, a sample c 2000 by CRC Press LLC
© 2000 by CRC Press LLC (15.6) When the clean signal is available for encoding the design problem is similarly defined, and it is obtained from Eq. (15.6) using z = y. The design problem in Eq. (15.6) is not standard since the encoder operates and represents different sources. The problem can be transformed into a standard coding problem by appropriately modifying the distortion measure. This was shown by Berger in 1971 and Ephraim and Gray in 1988 [Ephraim, 1992]. Specifically, define the modified distortion measure by (15.7) Then, by using iterated expectation in Eq. (15.8), the design problem becomes (15.8) A useful class of encoders for speech signals are those obtained from vector quantization. Vector quantizers are designed using the Lloyd algorithm [Gersho and Gray, 1991]. This is an iterative algorithm in which the codewords and the nearest neighbor regions are alternatively optimized. This algorithm can be applied to design vector quantizers for clean and noisy signals using the modified distortion measure. The problem of designing vector quantizers for noisy signals is related to the problem of estimating the clean signals from the noisy signals, as was shown by Wolf and Ziv in 1970 and Ephraim and Gray in 1988 [Ephraim, 1992]. Specifically, optimal waveform vector quantizers in the MMSE sense can be designed by first estimating the clean signal and then quantizing the estimated signal. Both estimation and quantization are performed in the MMSE sense. Similarly, optimal quantization of the vector of parameters of the AR model for the speech signal in the Itakura-Saito sense can be performed in two steps of estimation and quantization. Specifically, the autocorrelation function of the clean signal, which approximately constitutes the sufficient statistics of that signal for estimating the AR model, is first estimated in the MMSE sense. Then, optimal vector quantization in the Itakura-Saito sense is applied to the estimated autocorrelation. The estimation-quantization approach has been most popular in designing encoders for speech signals given noisy signals. Since such design requires explicit knowledge of the statistics of the clean signal and the noise process, but this knowledge is not available as argued in the second section, a variety of suboptimal encoders were proposed. Most of the research in this area focused on designing encoders for the AR model of the signal due to the importance of such encoders in low bit-rate speech coding. The proposed encoders mainly differ in the estimators they used and the functionals of the speech signal these estimators have been applied to. Important examples of functionals which have commonly been estimated include the signal waveform, autocorrelation, and the spectral magnitude. The primarily set of estimators used for this application were obtained from the spectral subtraction approach and its derivatives. A version of the sample average estimator was also developed and applied to AR modeling by Juang and Rabiner in 1987 [Ephraim, 1992]. Recently, the HMM-based estimator of the autocorrelation function of the clean signal was used in AR model vector quantization [Ephraim, 1992]. Designing of AR model-based encoders from noisy signals has been a very successful application of speech enhancement. In this case both the quality and intelligibility of the encoded signal can be improved compared to the case where the encoder is designed for the clean signal and the input noise is simply ignored. The reason is that the input noise has devastating effects of the performance of AR model-based speech coders, and any “reasonable” estimation approach can significantly improve the performance of those coders in noisy environments. Signal Classification In recognition of clean speech signals a sample function of the signal is associated with one of the words in the vocabulary. The association or decision rule is designed to minimize the probability of classification error. When only noisy speech signals are available for recognition a very similar problem results. Specifically, a sample min { ( , ( ))} q Edy qz d z qz Edy q z z ¢( , ( )) { ( , ( )) } D * min { ( , ( ))} q Ed z qz ¢