- Research
- Open Access
On pre-image iterations for speech enhancement
- Christina Leitner^{1}Email author and
- Franz Pernkopf^{2}
- Received: 2 December 2014
- Accepted: 17 April 2015
- Published: 4 June 2015
Abstract
In this paper, we apply kernel PCA for speech enhancement and derive pre-image iterations for speech enhancement. Both methods make use of a Gaussian kernel. The kernel variance serves as tuning parameter that has to be adapted according to the SNR and the desired degree of de-noising. We develop a method to derive a suitable value for the kernel variance from a noise estimate to adapt pre-image iterations to arbitrary SNRs. In experiments, we compare the performance of kernel PCA and pre-image iterations in terms of objective speech quality measures and automatic speech recognition. The speech data is corrupted by white and colored noise at 0, 5, 10, and 15 dB SNR. As a benchmark, we provide results of the generalized subspace method, of spectral subtraction, and of the minimum mean-square error log-spectral amplitude estimator. In terms of the scores of the PEASS (Perceptual Evaluation Methods for Audio Source Separation) toolbox, the proposed methods achieve a similar performance as the reference methods. The speech recognition experiments show that the utterances processed by pre-image iterations achieve a consistently better word recognition accuracy than the unprocessed noisy utterances and than the utterances processed by the generalized subspace method.
Keywords
- Speech enhancement
- Speech de-noising
- Kernel PCA
- Automatic speech recognition
Introduction
Speech enhancement is important in the field of speech communications and speech recognition. Many methods have been proposed in the literature (Loizou 2007). Spectral subtractive algorithms were among the first and are probably the simplest (Berouti et al. 1979; Boll 1979). They are based on the assumption that speech and noise are additive and thus the noisy speech signal can be enhanced by subtracting a noise estimate. Usually this is done in frequency domain using the magnitude of the short-time Fourier transform (STFT). For inverse transformation the phase of the noisy signal is considered. Statistical model-based methods provide a framework to find estimates of, e.g., the spectrum or magnitude spectrum of clean speech given the noisy speech spectrum (Ephraim and Malah 1984, 1985; McAulay and Malpass 1980). Subspace methods are based on the assumption that the clean signal only covers a subspace of the Euclidean space where the noisy speech signal exists (Ephraim and Van Trees 1995; Hu and Loizou 2003). Enhancement is performed by separating the noise subspace and the clean speech plus noise subspace and setting the components in the noise subspace to zero. Most speech enhancement algorithms make use of a noise estimate and their performance therefore heavily depends on the quality of the noise estimate. Poor noise estimates may lead to artifacts such as isolated peaks in the spectrum, which are perceived as tones of varying pitch and are known as musical noise (Berouti et al. 1979).
Subspace methods make use of principal component analysis (PCA) (Ephraim and Van Trees 1995; Hu and Loizou 2003), which is a linear technique. We therefore investigate if the quality of speech enhancement can be increased by applying a non-linear technique. This leads to the application of kernel methods, which constitute a simple possibility to make linear methods non-linear. Kernel methods transform data samples by mapping them from the input space to the so-called feature space. The non-linear extension of PCA is kernel PCA, which has already been successfully applied in image de-noising (Mika et al. 1999). In (Leitner et al. 2011), we proposed the use of kernel PCA for speech enhancement. Similar to image processing, we apply kernel PCA on patches extracted from the time-frequency representation of speech utterances.
For subspace methods, the number of principal components used for projection is a key parameter for the degree of de-noising. In our framework based on kernel PCA, we empirically observed (see results in Section 6) that the number of used components has almost no influence. We therefore ignore the projection step and only perform the reconstruction step necessary to determine the sample in input space corresponding to the de-noised sample in feature space. We call this pre-image iterations (PI) for speech enhancement, as the reconstructed sample in input space is called pre-image.
Besides their relation to subspace methods, PI exhibit a similarity to non-local neighborhood filtering (NF) applied for image de-noising (Buades et al. 2005; Singer et al. 2009). While other de-noising algorithms often compute the value of the de-noised pixel solely based on the value of its surrounding pixels, non-local neighborhood filters average over pixels that are located all over the image but have a similar neighborhood. This approach is favorable if images contain repetitive patterns such as textures. Although quite popular for image de-noising, NF has only recently gained attention in the field of speech enhancement. In (Talmon et al. 2011), NF is applied to suppress transient noise bursts. In contrast to our application of PI, NF is not directly applied for de-noising but to gain a noise estimate of the transients that is subsequently used for noise suppression.
In this paper, we compare the performance of kernel PCA and PI for speech enhancement. The variance of the kernel used for the pre-image computation is a tuning parameter that influences the degree of de-noising. Therefore, it has to be adapted according to the SNR. We develop a heuristic method to derive the kernel variance from a noise estimate. This way, PI adapt to different SNRs. Furthermore, an approach for colored noise is developed where the kernel variance is frequency-dependent. The performance of the proposed methods is evaluated in terms of objective speech quality measures and automatic speech recognition results. As objective measures, we employ the perceptual evaluation of speech quality (PESQ) measure (ITU-T 2001) and the scores of the perceptual evaluation of audio source separation (PEASS) toolbox (Emiya et al. 2011). Furthermore we use an automatic speech recognition (ASR) system to measure the performance of noise contaminated and subsequently enhanced data. Note, that the focus here is on evaluating the effects of the enhancement methods and not on optimizing the recognition results per se. Therefore, the speech recognizer is not adapted to the enhanced data.
Experiments are performed on noise corrupted speech from two databases, the airbone database and the Noizeus database. The utterances are contaminated by additive white Gaussian noise (AWGN) and car noise, respectively, at 0, 5, 10, and 15 dB SNR. As reference, performance results of the generalized subspace method (Hu and Loizou 2003), of spectral subtraction (Berouti et al. 1979), and of the minimum mean-square error (MMSE) log-spectral amplitude estimator (Ephraim and Malah 1985) are provided. In terms of PEASS scores, the proposed methods achieve a similar performance. In terms of word accuracy (WAcc), the utterances enhanced by PI show a significantly higher WAcc than the noisy utterances and the utterances processed by the generalized subspace method.
The paper is organized as follows: In Section 2, we summarize kernel PCA. In Section 3, we describe the application of kernel PCA for speech enhancement. In Section 4, we derive and analyze pre-image iterations and show commonalities to related methods in image and speech processing. In Section 5, we provide implementation details, introduce the used databases, evaluation measures, and the applied speech recognition system. In Section 6, the results are discussed. Section 7 concludes the paper and gives a perspective on future work.
Kernel PCA
An important property of kernel methods is that the mapping Φ(x) is usually not computed explicitly but only kernels between input samples are evaluated.
Kernel PCA is derived from PCA, which is a widely used technique for dimensionality reduction, lossy data compression, feature extraction, and data visualization. PCA is an orthogonal transformation of the coordinate system of the input data, i.e., the data is projected onto so-called principal axes. The new coordinates are called principal components. Often the structure in data can be described with sufficient accuracy while using only a small number of principal components. For de-noising, components with low variance are dropped as they are assumed to originate from noise (Mika et al. 1999; Schölkopf and Smola 2002; Schölkopf et al. 1996).
The product \(\left ({\mathbf {x}_{i}^{T}} \mathbf {u}_{l}\right) \mathbf {x}_{i}\) denotes a projection of the eigenvectors u _{ l } with λ _{ l }≠0 onto the samples x _{ i }. Therefore, following from Equation (5) all eigenvectors lie in the span of x _{ i },…,x _{ M }, i.e., all u _{ l } are linear combinations of x _{ i } and can be written as expansions of x _{ i } (Schölkopf and Smola 2002). As PCA is linear, its ability to retrieve the structure within a given data set is limited. If the principal components of variables are non-linearly related to the input variables, a non-linear feature extractor is more suitable. This is realized by kernel PCA (Mika et al. 1999; Schölkopf and Smola 2002).
In summary, to project x onto the eigenvectors v _{ k } in the following steps are required: (i) compute the kernel matrix K, (ii) compute its eigenvectors α _{ k } and normalize them using (13) and (14), (iii) project the data sample x using (15).
2.1 Centering
where 1 _{ M } is an M×M matrix with all entries equal to 1/M. The eigenvectors α _{ k } can then be computed by diagonalizing \(\tilde {\mathbf {K}}\) instead of K.
2.2 Kernel PCA for de-noising
where the eigenvectors are assumed to be ordered by decreasing eigenvalue size. Consequently, P _{ n } Φ(x) is a linear combination of the first n eigenvectors v _{ k } using the projections β _{ k } of (15) as weights. In case of using all v _{ k }, the data sample after projection equals the original data sample P _{ n } Φ(x)=Φ(x).
The drawback of de-noising in feature space is that in common applications the de-noised data is required in input space. The samples in input space that map to the projected samples in feature space, i.e., the pre-images, are determined by solving the pre-image problem.
with β _{ k } from (15) and α _{ ki }∈α _{ k } in (13). Note that the resulting pre-image z is always a linear combination of the input data x _{ i } weighted by the similarity between the pre-image z and the data samples x _{ i } and the coefficients γ _{ i }. This algorithm is sensitive to initialization which, however, can be tackled by reinitializing with different values.
where η is a non-negative regularization parameter and x _{ j } is the noisy sample corresponding to the de-noised sample z _{ j }. They show that the method is more stable than the method in (Mika et al. 1999).
Kernel PCA for speech enhancement
where \(\mathbf {z}_{j}^{t+1}\) is the j ^{th } enhanced sample within a frequency band at iteration t+1, x _{ i } are the noisy samples with i=1,⋯,M, \(\tilde {\gamma }_{i}\) is given by (22) and M is the number of samples in the frequency band. We initialize \({\mathbf {z}_{j}^{0}}\) with the noisy sample x _{ j } and iterate (24) until convergence. Finally, the sample vectors are rearranged to patches and the audio signal is synthesized as described in Section 5.1.
Pre-image iterations for speech enhancement
When subspace methods are applied for speech enhancement, the number of components used for the projection step of PCA is a key parameter. In our framework, we empirically observed that the number of components used for projection has only a minor effect on the outcome of the de-noising process. The de-noising quality is rather the same whether projection is performed on one or more components. De-noising is primarily influenced by the kernel weights and by the value of the kernel variance. Therefore, we completely neglect the projection coefficients \(\tilde {\gamma }_{i}\) in (24) by setting them to one.
The weights of the linear combination are determined by the kernel k(·,·), which serves as similarity measure between two samples. The kernel variance c is used as parameter to scale the degree to which samples are treated as similar.
where x _{ j } is the noisy sample, for which the pre-image should be found and η≥0 is the regularization parameter that determines the influence of the noisy sample x _{ j } in PI.
4.1 Analysis of pre-image iterations
computed between a feature vector x _{ j } and all vectors x _{ i } with i=1,…,M from one frequency band. This kernel vector always contains one large element equal to one because of self-similarity. The values of the other elements depend on the signal content.
4.2 Relation to non-local neighborhood filtering and to the non-local means algorithm
Performing de-noising on the time-frequency representation of speech incorporates some similarities to methods popular for image de-noising, namely, non-local neighborhood filtering and related methods. In many approaches for image and signal de-noising the de-noised value of the signal is based on neighboring signal values. Gaussian or Gabor filters and anisotropic diffusion are examples for such de-noising approaches.
Most of these methods, however, do not take into consideration one property of many signals and images, namely their repetitive behavior, which means that in most signals, patterns of the original noise-free signal occur at different time instances or spatial locations (Singer et al. 2009). For time-domain signals this is the case for every periodic or nearly periodic signal, for instance neuronal spikes or heart beats. In images, there may as well be patches that occur at different spatial locations, e.g., in textures. For de-noising, it is preferable to exploit the occurrence of similar patterns in distant regions of the signal. Instead of using the values in the neighborhood, de-noising is performed over pixels belonging to similar patterns found anywhere in the image. This is realized by NF and bilateral filtering (Barash 2002; Singer et al. 2009). NF is often executed iteratively, as a simple iteration is not sufficient to achieve de-noising. They have a similar iteration scheme as PI (Singer et al. 2009).
The non-local means (NL) algorithm proposed by Buades 2005 is derived from NF. The NL algorithm formulated in vector notation is equivalent to the first iteration of the pre-image iteration equation (25), if the neighborhoods of one pixel are chosen equivalently to patches. A substantial difference, however, is that in the case of speech enhancement the frequency bins – which correspond to the pixels – are complex-valued.
Besides image de-noising, NF has recently been applied in speech enhancement. In (Talmon 2011; Talmon et al. 2011), NF is employed to suppress transient noise. Transient noise consists of short bursts that most speech enhancement algorithms fail to suppress as they are restricted to stationary noise. The repetitive structure of transient noise that causes other enhancement algorithms to be unsuitable for suppression can be exploited by application of non-local filtering. Talmon et al. 2011 noted that the non-local neighborhood filter is equivalent to non-local diffusion filters (NLDF). Although NLDF and pre-image iterations are related, their purpose is considerably different. NLDF make use of a kernel to get reliable estimates of noise transients by constructive averaging. These noise estimates are subsequently used in a speech enhancement algorithm. PI on the other hand use the kernel directly as weight in a linear combination to attenuate noise by destructive averaging of complex-valued feature vectors.
4.3 Determination of the kernel variance in PI
As the performance of PI strongly depends on the kernel variance c, we adapt c for varying noise conditions and levels. Two heuristic approaches are used for the determination of the kernel variance, one for AWGN and one for colored noise (Leitner and Pernkopf 2013). Both make use of a mapping function to derive a suitable value for c from a noise estimate.
Additionally, the IPS score has to be greater than 10 to avoid the situation where S is large due to good TPS and APS scores but no de-noising is achieved. The noise power is estimated from the beginning of the recording, assuming stationary noise and no speech within this region. The values for c that lead to the highest score S for the individual utterances and the corresponding noise estimates are fitted by a polynomial of second order. This function is used to obtain values of c from noise estimates in the test signals.
For colored noise, a single value for c for all frequency bands is insufficient for substantial de-noising as the noise power is not equally distributed over the frequency range. For this reason we derive the averaged noise power estimate for each frequency band individually. These estimates are used in the mapping function derived for white noise to obtain values of c for each frequency band. In addition, we derive another mapping function by employing the measured global SNR after enhancement as optimization criterion instead of the score S. A comparison showed that the mapping function based on the global SNR results in better de-noising performance.
Experimental setup and evaluation
To evaluate the proposed speech enhancement algorithms, we performed four different experiments. For all experiments, the speech data was corrupted by noise at 0, 5, 10, and 15 dB SNR. In the first two experiments, we evaluate the results in terms of objective speech quality measures, namely, the PESQ measure and the scores of the PEASS toolbox. In the other two experiments, we compare the performance of a speech recognition system before and after enhancement by PI.
In experiment 1, we compare kernel PCA with the normalized iterative pre-image method (kPCA) as given in (24) and two variants of PI. For the variant denoted by PI_{cSNR}, a suitable value for the kernel variance c is derived from the performance on a development set for each SNR. For PI with heuristic determination of the kernel variance (PID) the kernel variance is derived from a mapping function as explained in Section 4.3. Enhancement is performed on data of the airbone database corrupted by AWGN.
In experiment 2, we perform enhancement on data of the Noizeus database corrupted by car noise. We evaluate two variants of PI with frequency-dependent determination of the kernel variance (PIDF) for colored noise. Both variants, PIDF_{SNR} and PIDF_{SNR-Var}, employ the SNR to derive the mapping function. Furthermore, the parameter settings of the feature extraction are varied for PIDF_{SNR-Var}.
Experiments 3 and 4 use a speech recognition system. In both experiments, data of the airbone database is tested. To train the automatic speech recognizer, we use data of the BAS PhonDat 1 database (Schiel and Baumann 2006). In experiment 3, the data is corrupted by AWGN and enhanced by PI_{cSNR} and PID. In experiment 4, the speech data is corrupted by car noise and enhanced by the PIDF method based on the PEASS scores (PIDF_{PEASS}).
5.1 Feature extraction and synthesis
5.2 Databases
5.2.1 Noizeus database
The Noizeus database was proposed to enable the comparison of speech enhancement methods (Hu and Loizou 2007). The database contains recordings of 30 IEEE sentences (in English) (IEEE Subcommitee 1969), spoken by three female and three male speakers (five sentences each). The sentences were recorded with 25 kHz sampling frequency and downsampled to 8 kHz. Furthermore, the speech signals were filtered by the modified Intermediate Reference System filters used in ITU-T P.862 (ITU-T 2001) to simulate the frequency characteristics of a telephone handset. The recordings are corrupted by eight types of real-world noise. The SNR computation is based on the active speech level (ASL) (ITU-T 2011). We use the data corrupted by car noise and additionally contaminated clean recordings by AWGN for the derivation of the mapping functions. The development set contains one sentence per speaker and SNR condition.
5.2.2 Airbone database
The airbone database consists of 120 utterances read by six speakers – three male and three female – of the Austrian variety of German (Domes 2009). The utterances are recorded by the close-talk microphone of a headset with a sampling frequency of 16 kHz. The headset is further supplied with a bone conduction microphone, hence the name airbone database. The signal of the bone microphone, however, is not used in this work. The data is corrupted by AWGN and by car noise from the NOISEX-92 database (Varga and Steeneken 1993) with consideration of the ASL. A subset of two utterances per speaker and SNR condition is used for development, i.e., for setting the kernel variance or for deriving the mapping function for estimating the kernel variance.
5.2.3 BAS PhonDat 1 database
The BAS PhonDat 1 (BAS PD1) database belongs to the Bavarian Archive for Speech Signals Corpora (Schiel and Baumann 2006). The BAS PD1 corpus contains read speech uttered by 201 different speakers of German. In total, 21587 utterances were recorded with a sampling frequency of 48 kHz. The data was downsampled to 16 kHz.
We use 4999 clean utterances of the BAS database to train the speech recognizer. These utterances correspond to 50 different speakers resulting in around 100 utterances per speaker and 1504 different words in total. The main reason to use the data of the BAS database is that the airbone database initially used for speech enhancement does not provide a sufficient amount of data for training. However, this way the effect of presenting unseen data to a speech recognizer can optimally be studied.
5.3 Objective quality measures
For objective evaluation we use two measures:
5.3.1 PESQ
The PESQ measure is recommended by the ITU-T for quality assessment of narrow-band telephone speech and narrow-band speech codecs (ITU-T 2001; Rix et al. 2001). The PESQ measure returns a mean opinion score (MOS) between 0.5 and 4.5. In (Hu and Loizou 2008), PESQ was reported to show high correlation with the outcome of subjective listening tests on speech enhancement algorithms.
5.3.2 PEASS
The objective measures of the PEASS toolbox are developed for audio source separation (Emiya et al. 2011). The design of these measures is based on the outcome of subjective listening tests and the measures strongly agree with subjective scores. With the PEASS toolbox four aspects of the signal can be tested: the global quality (OPS - overall perceptual score), the preservation of the target signal (TPS - target perceptual score), the suppression of other signal (IPS - interference perceptual score), and the absence of additional artificial noise (APS - artifact perceptual score). The scores range from 0 to 100, larger values denote better performance.
5.4 Automatic speech recognition
The automatic speech recognizer is based on the Hidden Markov Toolkit (HTK) (Young et al. 2006). The front-end (FE) and the back-end (BE) are both derived from the standard recognizer of the Aurora-4 database (Hirsch 2002). The FE computes Mel frequency cepstral coefficients (MFCCs) by using a sampling frequency of 16 kHz, a frame shift of 10 ms, a window length of 32 ms, 1024 frequency bins, 26 Mel channels, and 13 cepstral coefficients. Cepstral mean normalization is employed on the MFCCs. Furthermore, delta and delta-delta features are computed with a window length of 5 (half length 2). This finally leads to a feature vector of 39 components.
For training, the BE uses a dictionary based on 34 SAMPA-monophones. For each triphone, a hidden Markov model (HMM) is trained, which consists of 6 states and Gaussian mixture models of 8 components per state. To reduce the complexity and to overcome the lack of training data for some triphones, a tree-based clustering based on monophone-classification is applied. The grammar used for training is probabilistically modeled. In contrast to that, a rule-based grammar is applied for testing as the utterances of the airbone database obey very strict grammar rules.
where N is the number of words, S is the number of substitutions, D is the number of deletions and I is the number of insertions.
In addition to the WAcc, we evaluated if the performance difference between the pre-image iteration methods and the reference methods is statistically significant. We use a matched pairs test as recommended in (Gillick and Cox 1989). The matched pairs test is based on the pair-wise comparison of the recognition rates on the same utterance processed by two different algorithms. This test is suitable to test the significance of ASR results on speech segments that are statistically independent, i.e., an error in one segment is not influenced by an error in a preceding segment. This is the case for the experiments on the airbone database, as we test utterances independent from each other. For all evaluations, we employ a significance level of 0.01.
Results and discussion
In this section, we present the evaluation results of the experiments described in Section 5. As a benchmark, results of the generalized subspace method (Hu and Loizou 2003), spectral subtraction (Berouti et al. 1979), the MMSE log-spectral amplitude estimator (Ephraim and Malah 1985), and of the noisy baseline are given.
6.1 Experiment 1: Kernel PCA, PI with SNR-dependent kernel variance, and PI with heuristic determination of the kernel variance
All methods gain an improvement of overall quality (OPS) in comparison to the noisy speech data. The performance of PI_{cSNR} and PID is superior to the performance of kPCA and the generalized subspace method. For low SNRs, the OPS of PI_{cSNR} is similar to spectral subtraction and the MMSE log-spectral amplitude estimator, while for high SNRs the other methods are superior. The performance of PID is better than the reference methods in low SNRs. It is worth noting that the APS for the PI_{cSNR} and PID is better than for the other methods in most SNR conditions, indicating that there are few artifacts such as, for instance, musical noise in the case of the generalized subspace method and spectral subtraction.
Figure 7 shows the PESQ. All methods improve the score in comparison to the noisy speech data, except of PID at 15 dB SNR. This indicates that the used mapping function is not optimally chosen at high SNRs. Similar as for the OPS, the performance of PID is better than the performance of kPCA and PI_{cSNR}. In low SNRs, the score of PID is similar to the reference methods, while it is lower in high SNRs. This also suggests that the mapping function for high SNRs is not optimal. The presence of musical noise in the recordings enhanced by spectral subtraction and the generalized subspace method is not reflected by the PESQ measure.
Listening to the signals enhanced by the proposed methods reveals that noise is removed and no musical noise occurs^{a}. However, there is some background noise left around speech components, which is also reflected by the rather low IPS of the pre-image iteration methods. In the case of kPCA, a buzz-like artifact can be perceived. Note that this is well reflected by the low APS.
6.2 Experiment 2: PI with frequency-dependent determination of the kernel variance for colored noise
The overall quality of PIDF_{SNR} and PIDF_{SNR-Var} is better than the overall quality of the noisy signal and the generalized subspace method, however, lower than the overall quality of the other reference methods. PIDF_{SNR-Var} achieve consistently higher scores than PIDF_{SNR}. In terms of PESQ, the reference methods show superior performance, but the difference is rather small.
Listening to the signals enhanced by the PIDF methods reveals that there is noise left around speech components. For PIDF_{SNR-Var} the noise components are smoother than for PIDF_{SNR}, however, a hum can be perceived in the background. This is similar to the buzz-like artifact and caused by the smaller number of feature vectors in one frequency band due to the changed configuration. In the signals processed by the MMSE log-spectral amplitude estimator there is some background noise left and minor musical noise-like artifacts can be perceived, while the signals enhanced by spectral subtraction and the generalized subspace method are strongly affected by musical noise.
6.3 Experiment 3: ASR of data corrupted by white noise and enhanced by PID
WAcc on data corrupted by AWGN before and after enhancement
Condition | 0 dB | 5 dB | 10 dB | 15 dB | Average |
---|---|---|---|---|---|
Noisy | 0.00 | 15.56 | 38.89 | 65.56 | 30.00 |
PI_{cSNR} | 27.22 | 53.89 | 68.33 | 72.59 | 57.15 |
PID | 35.93 | 58.70 | 72.22 | 77.59 | 61.11 |
Subspace | 2.59 | 4.63 | 16.30 | 42.96 | 16.62 |
Subspace_{MNS} | 22.96 | 36.48 | 46.85 | 68.89 | 43.80 |
SpecSub | 25.74 | 53.15 | 73.89 | 85.56 | 59.59 |
LogMMSE | 37.78 | 58.15 | 74.63 | 89.07 | 64.91 |
Clean | 97.78 |
Results of the statistical significance test between PID and the reference methods for the WAcc in Table 1
PID | 0 dB | 5 dB | 10 dB | 15 dB |
---|---|---|---|---|
Noisy | * | * | * | * |
Subspace | * | * | * | * |
SpecSub | * | - | ||
LogMMSE | - | - | - |
The WAcc for the noisy data clearly states that the recognizer performance suffers from the noise contamination. The enhancement based on PI successfully increases the WAcc in comparison to the noisy data. The WAcc of the PID is always superior to the WAcc of the generalized subspace method, similar to the WAcc of spectral subtraction and lower than the WAcc of the MMSE log-spectral amplitude estimator. The superior performance of PID is significant for the generalized subspace method, for spectral subtraction at 0 dB SNR and the noisy data. The relatively high WAcc of the pre-image iteration methods shows a different trend compared to the PESQ results, where the scores of the reference methods are better than for the pre-image iteration methods. The comparison of PI_{cSNR} to PID reveals that PID always achieve higher word accuracies. This confirms that the heuristic determination of the kernel variance is preferable over using a fixed value for one noise condition.
To test the hypothesis that musical noise is problematic for the speech recognizer we further evaluated the WAcc on data corrupted by AWGN, enhanced by the generalized subspace method and subsequently post-processed by the musical noise suppression (MNS) method proposed in (Leitner and Pernkopf 2012). The results are included in Table 1 and denoted as Subspace_{MNS}. The WAcc is better after the MNS and the performance difference is significant. Hence, the musical noise is indeed a problem for the recognizer and speech enhancement methods introducing too many artifacts may be counterproductive, as shown for the generalized subspace method, where the WAcc is even lower than the WAcc for the noisy data.
6.4 Experiment 4: ASR of data corrupted by colored noise and enhanced by PIDF
Wacc on data corrupted by car noise before and after enhancement
Condition | 0 dB | 5 dB | 10 dB | 15 dB | Average |
---|---|---|---|---|---|
Noisy | 1.30 | 25.93 | 62.78 | 85.19 | 43.80 |
PIDF_{PEASS} | 34.95 | 62.04 | 81.48 | 89.26 | 66.93 |
Subspace | 8.52 | 27.04 | 66.85 | 81.48 | 45.97 |
SpecSub | 29.26 | 61.11 | 79.26 | 90.74 | 65.23 |
LogMMSE | 52.78 | 75.74 | 86.11 | 94.07 | 77.17 |
Clean | 97.78 |
Results of the statistical significance test between PIDF and the reference methods for the WAcc in Table 3
PIDF _{ PEASS } | 0 dB | 5 dB | 10 dB | 15 dB |
---|---|---|---|---|
Noisy | * | * | * | |
Subspace | * | * | * | * |
SpecSub | * | |||
LogMMSE | - | - | - | - |
The results for the experiments with car noise show that this type of noise is less harmful to the performance of the recognizer than white noise. This can be explained by the fact that the noise energy is concentrated below 1kHz, where the speech components are relatively strong and the distortion by the noise therefore is limited. Similar to the experiments with white noise, the WAcc of PIDF_{PEASS} is higher than the WAcc of the generalized subspace method, similar to the WAcc of spectral subtraction and lower than the performance of the MMSE log-spectral amplitude estimator. The performance is significantly better in comparison to the noisy data except for 15 dB, better than the generalized subspace method and than spectral subtraction for 0 dB.
Conclusion
In this paper, we used kernel PCA for speech enhancement. We apply kernel PCA on complex-valued feature vectors extracted from the time-frequency representation of noisy utterances and make use of an iterative pre-image method to synthesize the de-noised audio signal.
Experimental results show that for the iterative pre-image methods the weighting factor derived from the projection of kernel PCA only contributes little to de-noising. The de-noising mainly results from the linear combination of complex-valued feature vectors, which leads to cancellation of random-phase noise components. We therefore simplify the pre-image computation by setting the weighting coefficients to one and call this pre-image iterations for speech enhancement. Both kernel PCA and PI depend on the kernel variance as tuning parameter, which influences the degree of de-noising. We therefore extended PI by heuristic determination of the kernel variance for white noise and by frequency-dependent determination of the kernel variance for colored noise. This way, PI adapt to arbitrary noise conditions.
The evaluation in terms of PESQ and PEASS shows that the performance of kernel PCA and PI for speech enhancement is comparable to the performance of the reference methods in low SNRs, while in high SNRs spectral subtraction and the MMSE log-spectral amplitude estimator achieve better scores. We further evaluated the effect of speech enhancement on automatic speech recognition. The word accuracies on speech enhanced by PI are superior to the word accuracies achieved on noisy speech and by the generalized subspace method. In contrast to PI, the generalized subspace method is prone to musical noise, which deteriorates the recognition performance. The recognition performance for the MMSE log-spectral amplitude estimator is better than the performance of PI, while the performance for spectral subtraction is similar.
In future, we would like to extend the pre-image iteration method by a noise tracker to generalize the method from stationary noise to other noise types such as babble noise. Furthermore, we plan to build a recognizer for data of the Noizeus database for speech enhancement.
Endnote
^{a}Audio samples are provided on http://www2.spsc.tugraz.at/people/chrisl/audio/springer2015.
Declarations
Acknowledgements
This research has been carried out in the context of the national project NFN-SISE and the European project DIRHA. We gratefully acknowledge funding by the Austrian Science Fund (FWF) under the project number S10604-N13 and the European Commission under the project number FP7-ICT-2011-7-288121. The authors gratefully acknowledge Juan A. Morales-Cordovilla for providing the speech recognition system.
Authors’ Affiliations
References
- Abrahamsen, TJ, Hansen LK (2009) Input space regularization stabilizes pre-images for kernel PCA de-noising In: IEEE International Workshop on Machine Learning for Signal Processing (MLSP).Google Scholar
- Barash, D (2002) A fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. IEEE Trans Pattern Anal Mach Intell 24(6): 844–847. http://dx.doi.org/10.1109/TPAMI.2002.1008390 doi:10.1109/TPAMI.2002.1008390.View ArticleGoogle Scholar
- Berouti, M, Schwartz M, Makhoul J (1979) Enhancement of speech corrupted by acoustic noise In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), 208–211.Google Scholar
- Bishop, CM (2006) Pattern Recognition and Machine Learning. Springer, New York.Google Scholar
- Boll, SF (1979) Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans Acoustics, Speech Signal Process 27(2): 113–120. http://dx.doi.org/10.1109/TASSP.1979.1163209 doi:10.1109/TASSP.1979.1163209.View ArticleGoogle Scholar
- Buades, A, Coll B, Morel JM (2005) A review of image denoising algorithms, with a new one. Multiscale Model Simul 4(2): 480–530.View ArticleGoogle Scholar
- Domes, C (2009) Kombiniertes Luft- und Knochenleitungsmikrofon-Headset zur robusten Sprachsignalerfassung, Master’s thesis. Graz University of Technology, Graz.Google Scholar
- Emiya, V, Vincent E, Harlander N, Hohmann V (2011) Subjective and objective quality assessment of audio source separation. IEEE Trans Audio, Speech. Lang Process 19(7): 2046–2057. http://dx.doi.org/10.1109/TASL.2011.2109381 doi:10.1109/TASL.2011.2109381.View ArticleGoogle Scholar
- Ephraim, Y, Malah D (1984) Speech enhancement using a minimum mean-square error short-time spectral amplitude estimator. IEEE Trans Acoustics, Speech Signal Process. 32(6): 1109–1121. http://dx.doi.org/10.1109/TASSP.1985.1164550 doi:10.1109/TASSP.1985.1164550.View ArticleGoogle Scholar
- Ephraim, Y, Malah D (1985) Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans Acoustics. Speech Signal Process 33(2): 443–445. http://dx.doi.org/10.1109/TASSP.1985.1164550 doi:10.1109/TASSP.1985.1164550.View ArticleGoogle Scholar
- Ephraim, Y, Van Trees HL (1995) A signal subspace approach for speech enhancement. IEEE Trans Speech Audio Process 3(4): 251–266.View ArticleGoogle Scholar
- Griffin, DW, Lim JS (1984) Signal estimation from modified short-time Fourier transform. IEEE Trans Acoustics. Speech Signal Process 32(2): 236–243. http://dx.doi.org/10.1109/TASSP.1984.1164317 doi:10.1109/TASSP.1984.1164317.View ArticleGoogle Scholar
- Gillick, L, Cox S (1989) Some statistical issues in the comparison of speech recognition algorithms In: International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 532–535. http://dx.doi.org/10.1109/ICASSP.1989.266481 doi:10.1109/ICASSP.1989.266481.
- Hirsch, HG (2002) Experimental framework for the performance evaluation of speech recognition front-ends of large vocabulary task, Tech. rep., STQ AURORA DSR. Working Group.Google Scholar
- Honeine, P, Richard C (2011). IEEE Signal Process Mag 28(2): 77–88. http://dx.doi.org/10.1109/MSP.2010.939747 doi:10.1109/MSP.2010.939747.View ArticleGoogle Scholar
- Hu, Y, Loizou PC (2003) A generalized subspace approach for enhancing speech corrupted by colored noise. IEEE Trans Speech Audio Process 11: 334–341.View ArticleGoogle Scholar
- Hu, Y, Loizou PC (2007) Subjective evaluation and comparison of speech enhancement algorithms. Speech Commun 49: 588–601.View ArticleGoogle Scholar
- Hu, Y, Loizou PC (2008) Evaluation of objective quality measures for speech enhancement. IEEE Trans Audio, Speech. Lang Process 16(1): 229–238. http://dx.doi.org/10.1109/TASL.2007.911054 doi:10.1109/TASL.2007.911054.View ArticleGoogle Scholar
- Subcommitee, IEEE (1969) IEEE recommended practice for speech quality measurements. IEEE Trans Audio Electroacoustics 17(3): 225–246.View ArticleGoogle Scholar
- ITU-T (2001) Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. ITU-T Recommendation P.862, Geneva.Google Scholar
- ITU-T (2011) Objective measurement of active speech level, ITU-T Recommendation P.56, Geneva.Google Scholar
- Kwok, JT, Tsang IW (2004) The pre-image problem in kernel methods. IEEE Trans Neural Netw 15: 408–415.View ArticleGoogle Scholar
- Leitner, C, Pernkopf F (2012) Suppression of musical noise in enhanced speech using pre-image iterations In: 20th European Signal Processing Conference (EUSIPCO), 478–481.Google Scholar
- Leitner, C, Pernkopf F (2013) Generalization of pre-image iterations for speech enhancement In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), 7010–7014.Google Scholar
- Leitner, C, Pernkopf F, Kubin G (2011) Kernel PCA for speech enhancement In: 12th Annual Conference of the International Speech Communication Association (Interspeech), 1221–1224.Google Scholar
- Loizou, PC (2007) Speech Enhancement: Theory and Practice. CRC, Boca Raton.Google Scholar
- McAulay, R, Malpass M (1980) Speech enhancement using a soft-decision noise suppression filter. IEEE Trans Acoustics, Speech Signal Process 28(2): 137–145. http://dx.doi.org/10.1109/TASSP.1980.1163394 doi:10.1109/TASSP.1980.1163394.View ArticleGoogle Scholar
- Mika, S, Schölkopf B, Smola A, Müller K-R, Scholz M, Rätsch G (1999) Kernel PCA and de-noising in feature spaces. Adv Neural Inform Process Syst 11: 536–542.Google Scholar
- Rix, A, Beerends J, Hollier M, Hekstra A (2001) Perceptual evaluation of speech quality (PESQ) – a new method for speech quality assessment of telephone networks and codecs In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), 749 –752. http://dx.doi.org/10.1109/ICASSP.2001.941023 doi:10.1109/ICASSP.2001.941023.
- Schiel, F, Baumann A (2006) Phondat 1, corpus version 3.4., München. http://www.bas.unimuenchen.de/forschung/Bas/BasPD1eng.html.
- Schölkopf, B, Smola AJ (2002) Learning with Kernels. MA, Cambridge.Google Scholar
- Schölkopf, B, Smola A, Müller K-R (1996) Nonlinear component analysis as a kernel eigenvalue problem. Tech. rep., Max Planck Institute for Biological Cybernetics, Tübingen.Google Scholar
- Singer, A, Shkolnisky Y, Nadler B (2009) Diffusion interpretation of nonlocal neighborhood filters for signal denoising. SIAM J Imaging Sci 2(1): 118–139. http://dx.doi.org/10.1137/070712146 doi:10.1137/070712146.View ArticleGoogle Scholar
- Talmon, R (2011) Supervised speech processing based on geometric analysis. Ph.D. Technion – Israel Institute of Technology, Haifa.Google Scholar
- Talmon, R, Cohen I, Gannot S (2011) Transient noise reduction using nonlocal diffusion filters. IEEE Trans Audio, Speech. Lang Process 19(6): 1584–1599.View ArticleGoogle Scholar
- Varga, A, Steeneken HJ (1993) Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun 12(3): 247–251. http://dx.doi.org/10.1016/0167-6393(93)90095-3 doi:10.1016/0167-6393(93)90095-3.View ArticleGoogle Scholar
- Young, S, Evermann G, Gales M, Harin T, Kershaw D, Liu XA, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book. Cambridge University Engineering Department, Cambridge.Google Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.