We propose a novel phase-based method for single-channel speech enhancement to extract and enhance the desired signals in noisy environments by utilizing the phase information. In the method, a phase-dependent a priori signal-to-noise ratio (SNR) is estimated in the log-mel spectral domain to utilize both the magnitude and phase information of input speech signals. The phase-dependent estimator is incorporated into the conventional magnitude-based decision-directed approach that recursively computes the a priori SNR from noisy speech. Additionally, we reduce the performance degradation owing to the one-frame delay of the estimated phase-dependent a priori SNR by using a minimum mean square error (MMSE)-based and maximum a posteriori (MAP)-based estimator. In our speech enhancement experiments, the proposed phase-dependent a priori SNR estimator is shown to improve the output SNR by 2.6 dB for both the MMSE-based and MAP-based estimator cases as compared to a conventional magnitude-based estimator.
Along with the recent development in digital signal processing and multimedia communication technologies, a variety of speech communication services based on speech recognition systems have become popular. In general, although the speech recognition systems show high accuracy in quiet environments, they suffer rapid performance degradation in noisy environments. However, in a realistic speech recognition scenario, speech signals are frequently contaminated by background noisy sources. These noise sources have prevented the widespread use of automatic speech recognition systems in real environments. Automatic speech processing techniques still yield an inferior performance to the human ear in separating target speech signals from other mixed audio signals. A reduction of acoustical background noise or an enhancement of the speech signals is important to enhance the speech quality, reduce the degree of fatigue for speech communication terminals, and improve the speech recognition accuracy of smartphones [1]–[4].Single-channel speech enhancement technologies enhance the speech signals or reduce noise from the noisy signals captured by a single microphone. In a unified view toward single-microphone speech enhancement systems, the enhancement process depends on the estimation of the spectral gain, which is a function of the a priori signal-to-noise ratio (SNR) or a posteriori SNR, to enhance the desired signal [1]. A decision-directed (DD) approach is widely used to determine a priori SNR from noisy speech signals because it effectively reduces musical noise, which is the residual noise of estimated frames and is annoying to listeners [3]–[5]. However, this method has a serious drawback in that the estimated a priori SNR follows the shape of the a posteriori SNR with the delay of a single short time frame [5]. This delay is due to the use of the speech spectrum estimated in the previous frame to compute the current a priori SNR. In addition, in a conventional DD approach, only spectral magnitude components are used to compute the a priori SNR, and the phase components were disregarded based on the assumption that the phase difference has zero mean. However, the phase components are known to have some speech information and to be useful in human speech perception and automatic speech recognition [6]–[10].In this paper, we estimate the phase-dependent a priori SNR in the log-mel spectral domain by applying a nonlinear transform. In the proposed a priori SNR estimator, we do not assume that the phase components have zero mean. After translating the power spectral vector from a noisy speech signal to the log-mel spectral vector, we estimate the phase-dependent a priori SNR by utilizing both the magnitude and phase information. The conventional DD approach is also combined with the estimator to recursively obtain the a priori SNR. Detailed descriptions of the phase-dependent a priori SNR estimator can be found in [11], which was previously published as an article conference proceeding. In addition, we refine the estimated phase-dependent a priori SNR using the MMSE-based and MAP-based a priori SNR estimator to solve the delay problem while maintaining the advantages of the DD approach. Experimental results show that the proposed estimator improves the output SNR.The remainder of this paper is organized as follows. Section II describes the signal modeling. Section III describes a conventional DD approach, the proposed phase-dependent a priori SNR estimator in the log-mel domain, the phase-based DD approach, and the minimum mean square error (MMSE)-based and the maximum a posteriori (MAP)-based two-step a priori SNR estimator. Section IV describes the experimental results, and finally, Section V offers some concluding remarks.
II. Signal Modeling
Let x(t) and n(t) represent the original speech signal and a noise from a single microphone, respectively. The mixed speech signal y(t) is simply the sum of these two signals.$$y(t)=x(t)+n(t).$$We assume that x(t) and n(t) are uncorrelated with each other. Let X and N represent the spectral magnitude of speech signal and noise, respectively. Denoting the spectral magnitude of the noisy speech signal by Y, the relationship between the noisy speech, clean speech, and noise in the power spectral domain can be shown as follows [7]:$${Y}^{2}={X}^{2}+{N}^{2}+2\mathrm{cos}(\theta )XN,$$where θ is the phase vector with a phase difference between X and N.Typically, the phase term 2cos(θ)XN is disregarded based on the assumption that it is zero on average.$${Y}^{2}={X}^{2}+{N}^{2}\text{}.$$However, when (1) is nonlinearly transformed into the log-mel domain by taking the logarithm, the phase term might not be zero on average [7] because the mean of a nonlinearly transformed pdf is not necessarily equal to the transformed mean of the original pdf.The mel-scale filter is a filter bank whose center frequency is located in the mel-frequency scale, and its bandwidth increases as the center frequency increases. It resembles the human auditory system in that it is more sensitive in the low-frequency bands. Let mel represent the index at the mel-scale at f Hz, which is given as$$mel=1127\mathrm{ln}\left(f/700+1\right)\text{\hspace{0.17em}}.$$Figure 1 shows the 23 normalized mel-scale filters used in this work. For each mel-scale filter, a single coefficient is obtained by weighting the power spectrum coefficients within the mel-scale filter with a filter bank matrix.
Let Y_{p}, X_{p}, and N_{p} denote the products of Y^{2}, X^{2}, and N^{2} with the mel–filter bank matrix W, respectively [12]–[13]. Their relationship in the mel spectral domain becomes$${Y}_{p}={X}_{p}+{N}_{p}+2\sqrt{{X}_{p}{N}_{p}}\mathrm{cos}\left({\theta}_{{X}_{p}}-{\theta}_{{N}_{p}}\right)\text{\hspace{0.17em}},$$where θ_{Xp} and θ_{Np} are the phase spectrum of X_{p} and N_{p}, respectively. Since (5) is a quadratic function of
X p
, we can obtain the following two solutions:$${X}_{p}={\left(-{c}_{{X}_{p}{N}_{p}}\sqrt{{N}_{p}}\pm \sqrt{({c}_{{X}_{p}{N}_{p}}^{2}-1){N}_{p}+{Y}_{p}}\right)}^{2},$$where c_{XpNp} is defined as follows:$${c}_{{X}_{p}{N}_{p}}=\mathrm{cos}({\theta}_{{X}_{p}}-{\theta}_{{N}_{p}}).$$
III. Phase-Based Speech Enhancement Algorithm
- 1. DD Approach
The DD approach is a widely used method to determine a priori SNR from a noisy signal [4], where the a priori SNR is recursively estimated based on the definition of a priori SNR and its relationship with the a posteriori SNR. The a posteriori SNR, which is the parameter for noise suppression, is defined as the ratio of power spectra of a noisy signal and noise. The a posteriori SNR at the mth frame and kth frequency bin, γ(m,k), is given by$$\gamma (m,k)={\left|Y(m,k)\right|}^{2}/E\left({\left|N(m,k)\right|}^{2}\right).$$The noise power spectrum is estimated during speech pauses by using the weighted noise estimation method [14]. The a priori SNR can be defined as$$\xi (m,k)=E\left(\text{\hspace{0.17em}}{\left|X(m,k)\right|}^{2}\right)/E\left({\left|N(m,k)\right|}^{2}\right).$$The instantaneous SNR can be defined as$$\begin{array}{l}\upsilon (m,k)={\left|X(m,k)\right|}^{\text{2}}/E\left({\left|N(m,k)\right|}^{\text{2}}\right)\\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}=\left({\left|Y(m,k)\right|}^{\text{2}}-{\left|N(m,k)\right|}^{\text{2}}\right)/E\left(\text{\hspace{0.17em}}{\left|N(m,k)\right|}^{\text{2}}\right)\\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}=\left[{\left|Y(m,k)\right|}^{\text{2}}/E\left({\left|N(m,k)\right|}^{\text{2}}\right)\right]-\text{1}.\end{array}$$From the linear combination of the two expressions in (9) and (10), we obtain the new a priori SNR as$$\xi (m,k)=E\left\{\alpha \frac{{\left|X(m,k)\right|}^{\text{2}}}{E\left({\left|N(m,k)\right|}^{\text{2}}\right)}+[(\text{1}-\alpha )\times \upsilon (m,k)]\right\},$$with a weighting factor that is constrained to be 0<α<1.However, as the above expression is hard to implement in practice, approximations were made to determine the new a priori SNR recursively.$$\widehat{\xi}(m,k)=\alpha \frac{{\left|\widehat{X}(m-1,k)\right|}^{2}}{E\left({\left|N(m-1,k)\right|}^{2}\right)}+(1-\alpha )\mathrm{max}\left[\gamma (m,k)-1,0\right],$$where
| X ^ (m−1,k) | 2
and
E( | N(m−1,k) | 2 )
are the speech and noise power spectra estimated in the previous analysis frame, respectively.
- 2. Estimation of A Priori SNR in the Log-Mel Domain
Most speech enhancement methods generally use only the spectral magnitude by totally disregarding the phase [8]. The spectral phase component holds speech information and is used for human speech perception. There have been a few research activities on utilizing the phase information for speech recognition systems. In this paper, we propose a phase-dependent a priori SNR estimator to remove the background noise effectively and improve the performance of speech enhancement algorithms. We transform the power spectral vectors of noisy speech signals into the log-mel spectral vectors to make the non-zero phase term. Then, by estimating the a priori SNR in the log-mel domain and enhancing the desired speech signal, we utilize both the magnitude and phase component in the speech enhancement.The subtractive rule given in (6) is simple to use. However, it is ambiguous regarding the ± sign of (6) in that we have no simple way of knowing which sign to use. To avoid the sign ambiguity in (6), we algebraically derive an alternative subtractive rule by applying the cosine law to the vector diagram [4] shown in Fig. 2.
Diagram illustrating the trigonometric relationship of the clean, noise, and noisy signals.
$$\begin{array}{l}{X}_{p}={Y}_{p}+{N}_{p}-2\sqrt{{Y}_{p}{N}_{p}}{c}_{{Y}_{p}{N}_{p}},\\ {c}_{{Y}_{p}{N}_{p}}=\mathrm{cos}({\theta}_{{Y}_{p}}-{\theta}_{{N}_{p}}).\end{array}$$We derive the phase term C_{YpNp} at the mth frame and kth frequency bin by making the recursive equation relative to C_{XpNp} using the cosine law and spectrum
X ^ p (m−1,k)
estimated in the previous frame.$$\begin{array}{l}\sqrt{{X}_{p}}=\left(\sqrt{{Y}_{p}}{\widehat{c}}_{{Y}_{p}{N}_{p}}-\sqrt{{N}_{p}}\right)/{c}_{{X}_{p}{N}_{p}},\\ {\widehat{c}}_{{Y}_{p}{N}_{p}}(m,k)=\left(\frac{\sqrt{{X}_{p}(m-1,k)}}{\sqrt{{Y}_{p}(m-1,k)}}{c}_{{X}_{p}{N}_{p}}(m-1,k)+\frac{\sqrt{{N}_{p}(m,k)}}{\sqrt{{Y}_{p}(m,k)}}\right).\end{array}$$Let λ_{log(Np)}(m, k) be the noise power spectrum in the log-mel domain estimated during the speech pauses. The phase-dependent a priori SNR is defined as$$\begin{array}{l}{\xi}_{p}(m,k)=E\left[\mathrm{log}\left({X}_{p}(m,k)\right)\right]/{\lambda}_{\mathrm{log}({N}_{p})}(m,k),\\ {\lambda}_{\mathrm{log}({N}_{p})}(m,k)=E\left[\mathrm{log}\left({N}_{p}(m,k)\right)\right]\text{\hspace{0.17em}}.\end{array}$$The phase-dependent a posteriori SNR is given by$${\gamma}_{p}(m,k)=E\left[\mathrm{log}\left({Y}_{p}(m,k)\right)\right]/{\lambda}_{\mathrm{log}({N}_{p})}(m,k).$$In addition, the phase-dependent instantaneous SNR can be defined as$${\upsilon}_{p}(m,k)=\frac{E\left\{\mathrm{log}\left[{Y}_{p}(m,k)+{N}_{p}(m,k)-2{\widehat{c}}_{{Y}_{p}{N}_{p}}\sqrt{{Y}_{p}(m,k){N}_{p}(m,k)}\right]\right\}}{{\lambda}_{\mathrm{log}({N}_{p})}(m,k)}.$$
- 3. DD Approach in the Log-Mel Domain
To determine the a priori SNR from a noisy signal, we use the DD approach, where the a priori SNR is computed as a linear combination of (15) and (17) [4].$$\begin{array}{l}{\xi}_{p}(m,k)=E\left[\frac{1}{2}\frac{\mathrm{log}\left({X}_{p}(m,k)\right)}{{\lambda}_{\mathrm{log}({N}_{p})}(m,k)}\right]+\frac{1}{2}\overline{\xi}(m,k)\text{\hspace{0.17em}},\\ \overline{\xi}(m,k)=\frac{\mathrm{log}\left[{Y}_{p}(m,k)+{N}_{p}(m,k)-2{\widehat{c}}_{{Y}_{p}{N}_{p}}\sqrt{{Y}_{p}(m,k){N}_{p}(m,k)}\right]}{{\lambda}_{\mathrm{log}({N}_{p})}(m,k)}.\end{array}$$The final estimator is derived by making the preceding equation recursive.$${\widehat{\xi}}_{p}(m,k)=\alpha \frac{\mathrm{log}\left[{\widehat{X}}_{p}(m-1,k)\right]}{{\lambda}_{\mathrm{log}({N}_{p})}(m-1,k)}+(1-\alpha )\mathrm{max}\left[\overline{\xi}(m,k),0\right]\text{\hspace{0.17em}},$$where 0<α<1 is the weighting factor and
X ^ p (m−1,k)
is the magnitude estimator obtained in the previous analysis frame. In this paper we chose α = 0.98.
- 4. Two-Step A Priori SNR Estimation
In the conventional a priori SNR determination system with the DD approach, the estimated a priori SNR consequently follows the a posteriori SNR with a one-frame delay. This delay is due to the use of the speech spectrum estimated in the previous frame to compute the current a priori SNR; therefore, it degrades the speech enhancement performance.We propose a two-step phase-dependent a priori SNR estimator based on the MMSE and the MAP in the log-mel spectral domain to overcome the performance degradation caused by the one-frame delay.A. MMSE-Based A Priori SNR EstimatorThe MMSE estimator for the power spectral density X^{2} can be given by the conditional expectation$$\begin{array}{l}{\widehat{X}}^{2}=E({X}^{2}|Y)\\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}=\frac{{\displaystyle {\int}_{-\infty}^{\infty}{X}^{2}P(Y|X)P(X)dX}}{{\displaystyle {\int}_{-\infty}^{\infty}P(Y|X)P(X)dX}}\end{array}$$and redefined from (20) as [6]$${\widehat{X}}^{\text{2}}={\left[\frac{E(\text{\hspace{0.17em}}{\left|X\right|}^{\text{2}})}{E(\text{\hspace{0.17em}}{\left|X\right|}^{\text{2}})+E(\text{\hspace{0.17em}}{\left|N\right|}^{\text{2}})}\right]}^{\text{2}}{\left|Y\right|}^{\text{2}}+\frac{E(\text{\hspace{0.17em}}{\left|X\right|}^{\text{2}})E(\text{\hspace{0.17em}}{\left|N\right|}^{\text{2}})}{E(\text{\hspace{0.17em}}{\left|X\right|}^{\text{2}})+E(\text{\hspace{0.17em}}{\left|N\right|}^{\text{2}})}.$$Using (8) and (9) in (21), the MMSE-based a priori SNR estimation is given by$$\begin{array}{l}{\xi}_{\text{MMSE}}={\widehat{X}}^{2}/E(|N{|}^{2})\\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}=\left\{\xi /(1+\xi )\right\}\left\{1+\left[\xi /(1+\xi )\right]\gamma \right\}.\end{array}$$The first step of the phase-dependent a priori SNR estimation is the DD approach in the log-mel domain, whereas the second step, (22), is used to refine the estimated a priori SNR of the DD approach. Thus, the refined phase-dependent a priori SNR using the MMSE estimation is given by$$\begin{array}{l}{\tilde{\xi}}_{\text{MMSE}}=\left\{{\widehat{\xi}}_{p}/(1+{\widehat{\xi}}_{p})\right\}\left\{1+\left[{\widehat{\xi}}_{p}/(1+{\widehat{\xi}}_{p})\right]{\gamma}_{p}\right\},\\ {\gamma}_{p}=|{Y}_{p}{|}^{2}/E\left(\text{}|{N}_{p}{|}^{2}\right).\end{array}$$B. MAP-Based A Priori SNR EstimatorThe MAP-based estimator for the speech amplitude X can be given by the conditional expectation with the noisy speech amplitude Y as follows:$$\begin{array}{l}\widehat{X}=\mathrm{arg}\text{\hspace{0.17em}}\underset{x}{\mathrm{max}}\text{\hspace{0.17em}}p(X|Y)\\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}=\mathrm{arg}\text{\hspace{0.17em}}\underset{x}{\mathrm{max}}\text{\hspace{0.17em}}\frac{p(Y|X)p(X)}{p(Y)}.\end{array}$$The Rician pdf p(Y | X) is given by$$p\left(Y|X\right)=\frac{\text{2}Y}{E(\text{\hspace{0.17em}\hspace{0.17em}}{\left|N\right|}^{\text{2}})}{\text{e}}^{\left[-\frac{{X}^{\text{2}}+{Y}^{\text{2}}}{E\left({\left|N\right|}^{\text{2}}\right)}\right]}{\text{I}}_{\text{0}}\left[\frac{\text{2}YX}{E(\text{\hspace{0.17em}\hspace{0.17em}}{\left|N\right|}^{\text{2}})}\right],$$where I_{0} denotes the modified Bessel function with order zero. The pdf of the noisy spectrum Y conditioned on the speech amplitude and phase can be written as$$p(Y|X,\theta )=\frac{\text{1}}{\text{\pi}E(\text{\hspace{0.17em}}{\left|N\right|}^{\text{2}})}{\text{e}}^{\left[-\frac{{\left|Y-X{\text{e}}^{j\theta}\right|}^{\text{2}}}{E({\left|N\right|}^{\text{2}})}\right]}.$$The MAP-based a priori SNR estimation is obtained by maximizing p(Y | X)p(X) and is given by$$\begin{array}{c}{\xi}_{\text{MAP}}\text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}={\widehat{X}}^{\text{2}}/E(\text{}|N{|}^{\text{2}})\hfill \\ \text{\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}\hspace{0.17em}}=\left[\xi /(1+\xi )\right]\left[(1\text{\hspace{0.17em}/\hspace{0.17em}}4)+H\gamma \right].\end{array}$$The refined phase-dependent a priori SNR using the MAP estimation is given by$${\tilde{\xi}}_{\text{MAP}}=\left[{\widehat{\xi}}_{p}/(1+{\widehat{\xi}}_{p})\right]\left[(1/4)+H{\gamma}_{p}\right]\text{\hspace{0.17em}},$$where H is the noise suppression gain (de-noising filter).
- 5. Speech Reconstruction
The multiplicative gain function (de-noising filter) in the DD approach is a function of the a priori SNR given in [4].$$H(m,k)=\widehat{\xi}(m,k)/\left[1+\widehat{\xi}(m,k)\right].$$The enhanced speech spectrum is then obtained as follows:$$\widehat{X}(m,k)=H(m,k)\tilde{Y}(m,k).$$After translating the enhanced log-mel spectrum into the power spectral domain again, we finally obtain the entire enhanced speech signal by taking the inverse discrete Fourier transform (DFT) and applying the overlap-add method [15]. Figure 3 shows an example of the variations of the a priori SNR before and after refinement of the phase-dependent a priori SNR estimated using the MMSE-based approach. We can confirm that the proposed two-step phase-dependent a priori SNR estimation approach solves the delay problem.
Variations of the a priori SNR before and after refinement of the phase-dependent a priori SNR.
IV. Experimental Results and Discussion
- 1. Speech Database
A speech database was selected from the Interspeech 2006 Speech Separation Challenge [16], which was drawn from the GRID speech database consisting of six-word sentences with a vocabulary size of 52. The database was recorded by 34 different speakers and was sampled at 25 kHz. For speech enhancement experiments, three types of noise were used: car (N1), babble (N2), and white Gaussian (N3). The speech signal and noise sources were mixed by adding and scaling digitally to create three sets of noisy speech signal recordings. The mixed speech database was created at multiple SNRs: −10 dB, −5 dB, 0 dB, 5 dB, and 10 dB.For all experiments reported in this paper, the sampling rate of the speech database was reduced from 25 kHz to 16 kHz. The sampling rate of the noise was originally 16 kHz. All speech signals were normalized to have zero mean and unit variance, and were divided into frames of 32 ms in length with an overlap of 16 ms between adjacent frames. For each Hamming-windowed frame, a power spectral (magnitude spectrum) vector of 257 components was derived from a 512-point DFT analysis, and the power spectral vector was then transformed into a log-mel spectral vector. The number of dimensions of the log-mel spectral vector was chosen to be 128.
- 2. Results of Speech Enhancement
To validate the performance of the speech enhancement from a noisy signal, we compared the waveforms and spectrograms of the original speech signal, the noisy signal, and the enhanced speech signal using the proposed method. The output SNR was computed to evaluate the performance quantitatively. The conventional magnitude-based estimator (baseline) [3] was used as a reference for a performance comparison.A. Waveform and SpectrogramFigures 4 and 5 show examples of the waveforms and spectrograms of the original signal, the noisy speech signal, and the enhanced speech signals, where the N3 noise was added to make the noisy signal at a 0 dB SNR. The MMSE-based and MAP-based a priori SNR estimators were used to refine the phase-dependent a priori SNR.
Speech waveforms when N3 noise is added at 0 dB SNR: (a) original speech signal, (b) noisy signal, (c) enhanced signal using the conventional DD method, (d) enhanced signal using the proposed MMSE-based method, and (e) enhanced signal using the proposed MAP-based method.
Spectrograms when N3 noise is added at 0 dB SNR: (a) original speech signal, (b) noisy signal, (c) enhanced signal using the conventional DD method, (d) enhanced signal using the proposed MMSE-based method, and (e) enhanced signal using the proposed MAP-based method.
The DD approach was used in both the proposed method and the conventional DD magnitude-based method (baseline method). Comparing the waveforms and spectrograms, we confirmed that the noise was suppressed remarkably to yield an enhanced speech signal. In addition, in the listening tests using re-synthesized speech, we could hardly hear any background noise.B. Output SNRWe also calculated the output SNR of the enhanced speech signals, which is defined as the ratio of the power of the original clean speech signal and the power of the error signal between the original signal x(t) and the enhanced signal. It can be computed as follows:$$SNR=10{\mathrm{log}}_{10}[\begin{array}{l}\\ \underset{\underset{}{}}{}\end{array}\text{}\frac{|X{|}^{2}}{{\left(\left|X\right|-\left|\widehat{X}\right|\right)}^{2}}\text{}\begin{array}{l}\\ \underset{\underset{}{}}{}\end{array}],$$where | X | represents the magnitude spectrum of the clean speech signal and
| X ^ |
represents the magnitude spectrum of the reconstructed signal.Table 1 gives the average SNR with respect to the speech signals under the N1 noise condition using both the conventional DD magnitude-based method and the proposed phase-dependent a priori SNR estimator. Tables 2 and 3 show the averages SNR for the N2 and N3 noise conditions, respectively.When the MMSE-based a priori estimator is used to refine the estimated a priori SNR, the proposed phase-dependent speech enhancement algorithm effectively improves the output SNR by 3.1 dB, 2.0 dB, and 2.8 dB for the N1, N2, and N3 noise conditions, on average, compared to the conventional DD magnitude-based estimator (baseline method) [3], as shown in Tables 1 through 3. Overall, the proposed method improves the SNR by 2.6 dB on average for the N1, N2, and N3 noise conditions compared with the baseline.
Comparison of output SNRs (dB) for the N3 (white Gaussian) noise.
Input SNR (dB)
Baseline method
Proposed method
Step 1(before refine)
Step 2(MMSE)
Step 2(MAP)
−10
1.2
3.5
3.9
3.8
−5
1.7
4.1
4.5
4.3
0
2.7
5.3
5.7
5.6
5
4.0
6.8
7.0
7.1
10
5.7
8.0
8.2
8.1
Average
3.1
5.5
5.9
5.8
In the case of using the MAP-based a priori SNR estimator, the proposed algorithm improves the output SNR by 3.1 dB, 1.9 dB, and 2.7 dB for the N1, N2, and N3 noise conditions, respectively, on average, compared to the baseline. The MMSE-based a priori SNR estimator has slightly better output SNR improvement than the MAP-based estimator, and it could be confirmed that both estimators solve the delay problem.Overall, the proposed method improves the SNR by 2.6 dB on average for all noise conditions. In all cases, the proposed algorithm was better than the baseline in the output SNR measurements and outperformed stationary noise conditions such as N3. These results show that the proposed method significantly improves the objective quality measures by utilizing the phase information.Table 4 provides the average SNR and the Perceptual Evaluation of Speech Quality (PESQ) for an enhanced speech using different mel spectrum dimensions of 23, 32, 64, and 128 at 0 dB SNR, where the MMSE-based method is used under the white Gaussian (N3) noise condition. Table 5 shows the average SNR and PESQ for an enhanced speech signal using the MAP-based method to refine the estimated a priori SNR. The enhanced speech signal is obtained by extracting the speech feature in the log-mel domain, enhancing the speech, transforming the enhanced log-mel spectrum into the power spectral domain again, and reconstructing the speech signal. When the relatively lower dimensions of the mel spectrum are applied, as in the 23 and 32 cases, the signals are lumped together while being resynthesized, which causes a performance degradation of the proposed speech enhancement algorithm. In this work, we chose 128 mel spectrum dimensions, which improves the performance significantly and outputs the most similar waveform with the original speech signal after speech enhancement.
Average output SNR (dB) and PESQ of enhanced speech signal according to the mel-spectrum dimension (MAP, N3, 0 dB).
Mel-spectrum dimension
Proposed method
Output SNR (dB)
PESQ
23
2.3
1.61
32
4.1
2.51
64
7.6
2.75
128
7.8
2.89
Figure 6 shows the average PESQ with respect to the speech signals under the N3 noise condition using the conventional DD method and the proposed phase-dependent a priori SNR estimators. Overall, the proposed methods improve the PESQ on average. In addition, the proposed method with the MMSE-based a priori SNR estimator improves the PESQ by 0.3 for –10 dB and –5 dB, 0.4 for 0 dB and 5 dB, and 0.1 for 10 dB, on average, compared with the conventional DD approach.
Using the MAP-based estimator, the proposed method improves the PESQ by 0.2 for –10 dB, 0.3 for –5 dB and 0 dB, 0.4 for 5 dB, and 0.1 for 10 dB, on average, compared with the baseline. As the phase component of the speech signal is relatively more insensitive to human ears than the magnitude component, the speech enhancement performance does not increase as much.
V. Conclusion
We proposed a new single-channel speech enhancement method that estimates the phase-dependent a priori SNR in the log-mel spectral domain by considering the magnitude components of the speech signal and the phase components. In the proposed method, the phase term is no longer assumed to be zero because the power spectral vectors of noisy signals are nonlinearly transformed into the log-mel spectral vectors. The new phase-dependent a priori SNR is recursively updated by adopting the DD approach. The estimated phase-dependent a priori SNR is then refined to solve the delay problem while maintaining the advantages of the decision-directed (DD) approach. In this paper, the MMSE-based and MAP-based a priori SNR estimator is used to refine the estimated a priori SNR of the DD approach.By providing the waveforms and spectrograms of the enhanced speech signal, we showed that the enhanced signal was close to the original signal. In the listening tests, we could hardly hear any residual noise from the enhanced signal. In the quantitative evaluation tests, the proposed method with the MMSE-based a priori SNR estimator was shown to improve the output SNR by 3.1 dB, 2.0 dB, and 2.8 dB for car, babble, and white Gaussian noise, respectively. In addition, when the MAP-based a priori SNR estimator is used, the proposed method improves the output SNR by 3.1 dB, 1.9 dB, and 2.7 dB for the same noise conditions.The experimental results confirmed that the phase information is useful and can be used together with the magnitude information for improved speech enhancement systems.
This work was supported by the ICT R&D program of MSIP/IITP (10035252, Development of dialog-based spontaneous speech interface technology on mobile platform) and the research grant of the Chungbuk National University in 2011.
BIO
yunklee@etri.re.krYun-Kyung Lee received her BS degree in electronics engineering and MS degree in control and instrumentation engineering from Chungbuk National University (CBNU), Cheongju, Rep. of Korea, in 2007 and 2009, respectively. She received her PhD degree in control and robot engineering at CBNU, in 2013. Currently, she is in charge of the Spoken Language Processing Research Section, Electronic and Telecommunications Research Institute, Daejeon, Rep. of Korea. Her research interests are speech processing and automatic speech recognition technology.
jgp@etri.re.krJeon Gue Park received his PhD degree in information and communication engineering from Pai Chai University, Daejeon, Rep. of Korea, in 2010. Currently, he is in charge of the Spoken Language Processing Research Section, Electronic and Telecommunications Research Institute, Daejeon, Rep. of Korea. His current research interests include automatic speech recognition technology and dialogue systems.
yklee@etri.re.krYun Keun Lee received his BS and MS degrees in electronic engineering from Seoul National University, Seoul, Rep. of Korea and Korea Advanced Institute of Science and Technology (KAIST), Seoul, Rep. of Korea, in 1986 and 1988, respectively. He received his PhD degree in information and communication engineering from KAIST, in 1998. Currently, he is in charge of the Automatic Translation & Artificial Intelligence Research Center, ETRI, Daejeon, Rep. of Korea.
Corresponding Authorowkwon@cbnu.ac.krOh-Wook Kwon received his BS degree in electronics engineering from Seoul National University, Seoul, Rep. of Korea, in 1986 and his MS and PhD degrees in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Rep. of Korea, in 1988 and 1997, respectively. From 1988, he was with the Electronics and Telecommunications Research Institute, Daejeon, Rep. of Korea. In 2000, he joined the Brain Science Research Center, KAIST, as a research professor. From 2001 to 2003, he worked for the Institute for Neural Computation, University of California, San Diego, CA, USA, as a postgraduate researcher. Since 2003, he has been a professor at Chungbuk National University, Cheongju, Rep. of Korea. His research interests include speech recognition; speech and audio signal processing; and pattern recognition.
Song H.J.
,
Lee Y.K.
,
Kim H.S.
2012
“Probabilistic Bilinear Transformation Space-Based Joint Maximum A Posteriori Adaptation,”
ETRI J.
34
(5)
783 -
786
DOI : 10.4218/etrij.12.0212.0054
Ephraim Y.
,
Malah D.
1984
“Speech Enhancement Using a Minimum-Mean Square Error Short–Time Spectral Amplitude Estimator,”
IEEE Trans. Acoust., Speech, Signal Process.
32
(6)
1109 -
1121
DOI : 10.1109/TASSP.1984.1164453
Alam M.J.
,
O’Shaughnessy D.
,
Selouani S.-A.
2008
“Speech Enhancement Based on Novel Two-Step A Priori SNR Estimators,”
Proc. INTERSPEECH
Brisbane, Australia
565 -
568
Wang D.L.
,
Lim J.S.
1982
“The Unimportance of Phase in Speech Enhancements,”
IEEE Trans. Acoust., Speech, Signal Process.
30
(4)
679 -
681
DOI : 10.1109/TASSP.1982.1163920
Faubel F.
,
Mcdonough J.
,
Klakow D.
2008
“A Phase-Averaged Model for the Relationship between Noisy Speech, Clean Speech, and Noise in the Log-Mel Domain,”
Proc. INTERSPEECH
Brisbane, Australia
553 -
556
Deng L.
,
Droppo J.
,
Acero A.
2004
“Enhancement of Log Mel Power Spectra of Speech Using a Phase–Sensitive Model of the Acoustic Environment and Sequential Estimation of the Corrupting Noise,”
IEEE Trans. Speech Audio Process.
12
(2)
133 -
143
DOI : 10.1109/TSA.2003.820201
Lee Y.-K.
,
Kwon O.-W.
“A Phase-Dependent A Priori SNR Estimator in the Log-Mel Spectral Domain for Speech Enhancement,”
IEEE Int. Conf. Consum. Electron.
Las Vegas, NV, USA
Jan. 9–12, 2011
413 -
414
DOI : 10.1109/ICCE.2011.5722657
Andrassy B.
,
Vlaj D.
,
Beaugeant C.
2001
“Recognition Performance of the Siemens Front-End with and without Frame Dropping on the Aurora 2 Database,”
Proc. European Conf. Speech Commun. Technol.
1
193 -
196
Sigurdsson S.
,
Petersen K.B.
,
Lehn-Schiøle T.
2006
“Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music,”
Proc. Int. Conf. Music Inf. Retrieval
Victoria, Canada
Kato M.
,
Sugiyama A.
,
Serizawa M.
2002
“Noise Suppression with High Speech Quality Based on Weighted Noise Estimation and MMSE STSA,”
IEICE Trans. Fundam.
E85–A
(7)
1710 -
1718
Cooke M.P.
2006
“An Audio-Visual Corpus for Speech Perception and Automatic Speech Recognition,”
J. Acoust. Soc. America
120
(5)
2421 -
2424
DOI : 10.1121/1.2229005
Citing 'Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain
'
@article{ HJTODO_2014_v36n5_721}
,title={Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain}
,volume={5}
, url={http://dx.doi.org/10.4218/etrij.14.2214.0039}, DOI={10.4218/etrij.14.2214.0039}
, number= {5}
, journal={ETRI Journal}
, publisher={Electronics and Telecommunications Research Institute}
, author={Lee, Yun-Kyung
and
Park, Jeon Gue
and
Lee, Yun Keun
and
Kwon, Oh-Wook}
, year={2014}
, month={Aug}
TY - JOUR
T2 - ETRI Journal
AU - Lee, Yun-Kyung
AU - Park, Jeon Gue
AU - Lee, Yun Keun
AU - Kwon, Oh-Wook
SN - 1225-6463
TI - Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain
VL - 36
PB - Electronics and Telecommunications Research Institute
DO - 10.4218/etrij.14.2214.0039
PY - 2014
UR - http://dx.doi.org/10.4218/etrij.14.2214.0039
ER -
Lee, Y. K.
,
Park, J. G.
,
Lee, Y. K.
,
&
Kwon, O. W.
( 2014).
Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain.
ETRI Journal,
36
(5)
Electronics and Telecommunications Research Institute.
doi:10.4218/etrij.14.2214.0039
Lee, YK
,
Park, JG
,
Lee, YK
,
&
Kwon, OW
2014,
Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain,
ETRI Journal,
vol. 5,
no. 5,
Retrieved from http://dx.doi.org/10.4218/etrij.14.2214.0039
[1]
YK Lee
,
JG Park
,
YK Lee
,
and
OW Kwon
,
“Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain”,
ETRI Journal,
vol. 5,
no. 5,
Aug
2014.
Lee, Yun-Kyung
Park, Jeon Gue
Lee, Yun Keun
et al.
“Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain”
ETRI Journal,
5.
5
2014:
Lee, YK
,
Park, JG
,
Lee, YK
,
Kwon, OW
Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain.
ETRI Journal
[Internet].
2014.
Aug ;
5
(5)
Available from http://dx.doi.org/10.4218/etrij.14.2214.0039
Lee, Yun-Kyung
,
Park, Jeon Gue
,
Lee, Yun Keun
,
and
Kwon, Oh-Wook
,
“Speech Enhancement Using Phase-Dependent A Priori SNR Estimator in Log-Mel Spectral Domain.”
ETRI Journal
5
no.5
()
Aug,
2014):
http://dx.doi.org/10.4218/etrij.14.2214.0039