# A bio-inspired feature extraction for robust speech recognition

- Youssef Zouhir
^{1}Email author and - Kaïs Ouni
^{1}

**3**:651

https://doi.org/10.1186/2193-1801-3-651

© Zouhir and Ouni; licensee Springer. 2014

**Received: **29 August 2014

**Accepted: **24 October 2014

**Published: **4 November 2014

## Abstract

In this paper, a feature extraction method for robust speech recognition in noisy environments is proposed. The proposed method is motivated by a biologically inspired auditory model which simulates the outer/middle ear filtering by a low-pass filter and the spectral behaviour of the cochlea by the Gammachirp auditory filterbank (GcFB). The speech recognition performance of our method is tested on speech signals corrupted by real-world noises. The evaluation results show that the proposed method gives better recognition rates compared to the classic techniques such as Perceptual Linear Prediction (PLP), Linear Predictive Coding (LPC), Linear Prediction Cepstral coefficients (LPCC) and Mel Frequency Cepstral Coefficients (MFCC). The used recognition system is based on the Hidden Markov Models with continuous Gaussian Mixture densities (HMM-GM).

## Keywords

## Introduction

The Automatic speech recognition (ASR) system is one of the leading technologies acting on man–machine communication in real-world applications (Furui 2010). The ASR system is composed of two main modules. The first one is the acoustic Front-end (or feature extractor). This module generally uses the classical acoustic feature extraction techniques such as Perceptual Linear Prediction (PLP) (Hermansky 1990), Linear Prediction Coding (LPC) (Atal and Hanauer 1971), Linear Prediction Cepstral Coefficients (LPCC) (Atal 1974) and Mel Frequency Cepstral Coefficients (MFCC) (Davis and Mermelstein 1980). The second module is the classifier which is commonly based on the Hidden Markov Models.

The early feature based techniques involve incorporation of different psychoacoustic and neurophysical knowledge obtained from the study of the human auditory system which is capable of segmenting, localizing, and recognizing speech signal in noisy conditions without a noticeable degradation in performance of recognition (Rabiner and Juang 1993).

Generally, the feature extraction techniques are based on auditory filter modelling which uses a filterbank to simulate the cochlear filtering (Meddis et al. 2010). The efficient modelling of this auditory filterbank will improve the recognition performance and the features robustness in noisy environments.

The gammatone filterbank has been employed as the auditory filter modelling in various speech processing systems such as the Computational Auditory Scene Analysis system (Wang and Brown 2006).

Irino and Patterson have proposed an excellent candidate model for asymmetric, level-dependent cochlear filter called the Gammachirp auditory filter consistent with basic physiological data (Irino and Patterson 1997, 2006). This filter represents an extension of the gammatone filter characterized by an additional chirp parameter in order to produce an asymmetric amplitude spectrum. It provides an approximation of the auditory frequency response.

In this paper, we propose a biologically-inspired feature extraction method for robust recognition of noisy speech signals. The proposed method is based on the human auditory system characteristics, and relies on both the outer and middle ear filtering and the spectral behaviour of the cochlea. The outer and middle ear filtering is modelled by a second-order low-pass filter (Martens and Van Immerseel 1990; Van Immerseel and Martens 1992). The cochlear filter is modelled by a gammachirp auditory filterbank consisting of 34 filters, where the centre frequencies are equally spaced on the ERB-rate scale from 50 Hz to 8 kHz.

The HTK 3.4.1 toolkit is exploited in the Model training and recognition of speech signals. It is based on Gaussian Mixture density Hidden Markov models (Young et al. 2009). In our work, the HMM is trained for each word with five observation states and each state emission density consists of the four Gaussian Mixture densities.

The recognition performance of our feature extraction method was evaluated on speech signals corrupted by real-world noisy environments. The obtained results are compared to those obtained using PLP, LPC, LPCC and MFCC.

The paper is organized as follows: After introduction, section 2 presents the speech recognition system based on the hidden Markov models. It also introduces the classic feature extraction techniques of speech signals. In section 3, the proposed feature extraction method based on an auditory filter model is detailed, while introducing the auditory filter modelling. The experimental and evaluation results of our method are discussed in the section 4. Finally, conclusions are presented in the last section.

## The speech recognition system

### The HMM based ASR

_{1}, o

_{2}, o

_{3}, …, o

_{t}, … o

_{T}, where o

_{t}is the acoustic vector observed at time t) associated to each word is modelled as being generated by a Markov Model (Young et al. 2009) as shown in Figure 2.

The HMM represents a finite state machine which generates, at each state change, an acoustic vector *o*_{
t
} observed from the probability density *b*_{
j
}(*o*_{
t
}). The changes of state occur at every time unit according to the state transition probability from state *i* to state *j* is given by *a*_{
ij
}. Figure 2 shows an example representing the observation sequence *o*_{1} to *o*_{5} for the state sequence S = 1, 2, 2, 3, 4, 4, 5, generated from a five state HMM with non-emitting entry and exit states. The HMM supports continuous Gaussian Mixture density distributions.

*b*

_{ j }(

*o*

_{ t }) of being in state

*j*at time

*t*is given by (Young et al. 2009)

*K*

_{ j }is the number of mixture components in state

*j*,

*c*

_{ jk }is the weight of the

*k*’ th component and

*N*(

*o*;

*μ*, ϑ) is a multivariate Gaussian defined by (Young et al. 2009)

Where *n* is the dimensionality of *o*, *ϑ* is covariance matrix and *μ* is mean vector.

### Classical feature extraction techniques

The most common techniques of feature extraction for speech recognition system employ the cepstral analysis to extract the feature coefficients from acoustic signal such as the MFCC and the LPCC. The MFCC technique consists to calculate the feature vectors from the frequency spectra at each frame of windowed speech. It is based on the human ear scale known the Mel scale.

The MFCC coefficients are calculated by applying a cosine transform to the real logarithm of short-term energy spectrum which has been expressed on a Mel-frequency scale.

Where *p, G* and *a*_{
k
} are respectively the number of poles, the filter gain and the poles parameters which are called Linear Prediction Coefficients. The linear prediction coefficients are evaluated using the autocorrelation method.

## The proposed feature extraction based on an auditory filter model

The proposed extraction method of speech feature for ASR is based on an auditory filter model. This model simulates the outer/middle ear filtering and the spectral behaviour of the cochlea.

### Auditory filter modelling

The auditory filter modelling represents the mathematical model which tends to simulate the basic perceptual and psychophysical aspects of the human auditory characteristics (Lyon et al. 2010). This model consists of the simulation of the outer/middle ear filtering by second-order low-pass filter and the cochlea spectral behaviour by the gammachirp auditory filterbank.

*f*

_{ r }= 2

*π*/

*ω*

_{0}) equal to 4 kHz (Martens and Van Immerseel 1990; Van Immerseel and Martens 1992).

Where time *t* > 0, *a*, *f*_{0}, *φ* and *c* are the amplitude, the asymptotic frequency, the initial phase and the chirp rate respectively. *b* and *n* are the two parameters which define the gamma distribution envelope. “ln” denotes the natural logarithm.

*ERB*(

*f*

_{0}) is the equivalent rectangular bandwidth (

*ERB*) of the Gammachirp auditory filters centered around

*f*

_{0}(Irino and Patterson 2006). The value of

*ERB*is expressed by the following equation (Glasberg and Moore 1990; Moore 2012; Wang and Brown 2006).

*ERBs*number,

*ERBrate*(

*f*), and can be expressed by (Glasberg and Moore 1990; Moore 2012; Wang and Brown 2006).

Where $\mathrm{\theta}=\mathrm{arctg}\left(\frac{f-{f}_{0}}{\mathit{bERB}\left({f}_{0}\right)}\right)$ and *Γ*(n + jc) is the complex gamma distribution.

### The proposed feature extraction method

## Experimental results

This section evaluates the robustness of the proposed feature extraction method under various types of noisy environments.

### Databases and experimental setup

The used speech recognition system is based on Hidden Markov Models. Our system employs the HTK 3.4.1 (Young et al. 2009) in the recognition task. The HTK 3.4.1 is a portable toolkit which allows the construction and manipulation of HMM-GM.

The HMM topology used in our experiments is a five states left-to-right model with a four Gaussian Mixture observation probability density distribution characterized by a diagonal covariance matrix.

**Used Gammachirp parameters**

Parameter | Value |
---|---|

n | 4 |

a | 1 |

b | 1.019 |

c | 2 |

| 0 |

### Results and discussion

For the baseline experiments, 12 coefficients of each technique were calculated from speech signal using Hamming analysis window with length equal to 25 ms and shifted with 10 ms steps.

The recognition performance of our feature extraction method has been compared to that of the classic techniques such as PLP, LPCC, LPC, and MFCC. The feature coefficients of each technique are combined with energy (*E*), differential coefficients first (∆) and second order (*A*) (12 coefficients +*E* + ∆ + *A*).

**Recognition rate (%) obtained by proposed and standard methods with suburban train noise**

Recognition rate with HMM-4-GM | ||||||
---|---|---|---|---|---|---|

SNR level | PLPaGc | PLP | LPCC | LPC | MFCC | |

0 dB | 38.55 | 27.77 | 21.79 | 11.86 | 26.95 | |

5 dB | 65.59 | 50.16 | 40.48 | 13.62 | 49.42 | |

Suburban train noise | 10 dB | 84.71 | 72.74 | 60.96 | 18.47 | 71.66 |

15 dB | 92.74 | 85.82 | 77.90 | 28.96 | 86.30 | |

20 dB | 95.77 | 91.72 | 87.06 | 41.96 | 92.60 | |

Average | 75.47 | 65.64 | 57.64 | 22.97 | 65.39 |

**Recognition rate (%) obtained by proposed and standard methods with exhibition hall noise**

Recognition rate with HMM-4-GM | ||||||
---|---|---|---|---|---|---|

SNR level | PLPaGc | PLP | LPCC | LPC | MFCC | |

0 dB | 37.53 | 26.67 | 18.33 | 8.31 | 26.04 | |

5 dB | 61.36 | 48.31 | 39.06 | 14.67 | 47.18 | |

Exhibition hall noise | 10 dB | 81.73 | 69.30 | 60.54 | 20.65 | 68.74 |

15 dB | 90.58 | 84.17 | 77.99 | 29.87 | 84.09 | |

20 dB | 95.74 | 91.40 | 86.92 | 40.00 | 92.14 | |

Average | 73.39 | 63.97 | 56.57 | 22.70 | 63.64 |

**Recognition rate (%) obtained by proposed and standard methods with street noise**

Recognition rate with HMM-4-GM | ||||||
---|---|---|---|---|---|---|

SNR level | PLPaGc | PLP | LPCC | LPC | MFCC | |

0 dB | 39.86 | 32.03 | 25.13 | 10.52 | 30.64 | |

5 dB | 65.90 | 51.60 | 41.73 | 12.65 | 50.52 | |

Street noise | 10 dB | 84.26 | 72.99 | 60.51 | 16.88 | 73.13 |

15 dB | 92.84 | 85.93 | 76.79 | 26.35 | 86.33 | |

20 dB | 96.00 | 91.63 | 87.09 | 38.04 | 92.31 | |

Average | 75.70 | 66.84 | 58.25 | 20.89 | 66.59 |

**Recognition rate (%) obtained by proposed and standard methods with car noise**

Recognition rate with HMM-4-GM | ||||||
---|---|---|---|---|---|---|

SNR level | PLPaGc | PLP | LPCC | LPC | MFCC | |

0 dB | 45.96 | 28.51 | 23.15 | 10.13 | 29.19 | |

5 dB | 70.81 | 56.37 | 46.55 | 13.50 | 56.14 | |

Car noise | 10 dB | 88.94 | 80.57 | 70.87 | 20.65 | 81.08 |

15 dB | 94.84 | 91.55 | 86.07 | 31.74 | 92.23 | |

20 dB | 96.74 | 94.89 | 91.60 | 43.21 | 95.63 | |

Average | 79.46 | 70.38 | 63.65 | 23.85 | 70.85 |

As illustrated in the tables, the PLPaGc feature outperforms the four classic features in all noise conditions. For example, in the case of suburban train noise, the average of all noise levels of recognition rates achieved using PLPaGc feature is 75.47, while PLP, LPCC, LPC and MFCC feature provides respectively 65.64, 57.64, 22.97 and 65.39. It can be also observed that the recognition rates increase in all features when the noise level is decreased with respect to the signal level (i.e., SNR increases from 0 dB to 20 dB).

## Conclusion

A new auditory filter modelling-based feature extraction method for noisy speech recognition was presented in this paper. The proposed method was motivated by the research studies of the human peripheral auditory modelling. The used auditory model consists of simulating the outer/middle ear filtering by a second order low-pass filter and the cochlea spectral behaviour by the gammachirp auditory filterbank, where the values of those centre frequencies are chosen according to the ERB rate scale. The robustness of the proposed PLPaGc feature was evaluated on speech recognition rate in real-world noisy environments. The experimental results show that the PLPaGc feature gives better recognition rates compared to four classical PLP, LPCC, LPC and MFCC feature.

## Declarations

## Authors’ Affiliations

## References

- Atal BS: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification.
*J Acoust Soc Am*1974, 55(6):1304-12. 10.1121/1.1914702View ArticleGoogle Scholar - Atal BS, Hanauer SL: Speech analysis and synthesis by linear prediction of the speech wave.
*J Acoust Soc Am*1971, 50: 637-55. 10.1121/1.1912679View ArticleGoogle Scholar - Beigi H:
*Fundamentals of Speaker Recognition*. Springer, New York; 2011.View ArticleGoogle Scholar - Bleeck S, Ives T, Patterson RD: Aim-mat: the auditry image model in MATLAB.
*Acta Acustica United Ac*2004, 90(4):781-787.Google Scholar - Davis SB, Mermelstein P: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.
*IEEE Trans Acoust, Speech, Signal Processing*1980, 28(4):357-66. 10.1109/TASSP.1980.1163420View ArticleGoogle Scholar - Furui S: History and Development of Speech Recognition. In
*Speech Technology*. Edited by: Chen F, Jokinen K. USA: Springer; 2010:1-18.View ArticleGoogle Scholar - Garofolo J, Lamel L, Fisher W, Fiscus J, Pallett D, Dahlgren N:
*DARPA, TIMIT Acoustic-Phonetic Continuous Speech Corpus CD-ROM. National Institute of Standards and Technology*. 1990.Google Scholar - Glasberg BR, Moore BCJ: Derivation of auditory filter shapes from notched-noise data.
*Hear Res*1990, 47(1):103-38.View ArticleGoogle Scholar - Hermansky H: Perceptual linear predictive (PLP) analysis of speech.
*J Acoust Soc Am*1990, 87(4):1738-52. 10.1121/1.399423View ArticleGoogle Scholar - Hirsch H, Pearce D:
*The Aurora Experimental Framework for the Performance Evaluation of Speech Recognition Systems Under Noisy Conditions*. ISCA ITRW ASR2000, Paris, France; 2000.Google Scholar - Irino T, Patterson RD: A time-domain, level-dependent auditory filter: the Gammachirp.
*J Acoust Soc Am*1997, 101(1):412-419. 10.1121/1.417975View ArticleGoogle Scholar - Irino T, Patterson RD: A dynamic compressive gammachirp auditory filterbank.
*IEEE Trans Audio Speech Lang Processing*2006, 14(6):2222-32. Author manuscript, available in PMC 2009View ArticleGoogle Scholar - Lyon RF, Katsiamis AG, Drakakis EM: History and future of auditory filter models.
*Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS)*2010, 3809-12.View ArticleGoogle Scholar - Martens JP, Van Immerseel L: An auditory model based on the analysis of envelope patterns.
*Int Conf Acoust Speech Signal Process*1990, 1: 401-4. ICASSP-90View ArticleGoogle Scholar - Meddis R, Lopez-Poveda EA, Fay RR, Popper AN:
*Computational Models of the Auditory System. Vol. 35*. Springer Handbook of Auditory Research, Springer, New York; 2010.View ArticleGoogle Scholar - Moore BCJ:
*An Introduction to the Psychology of Hearing*. 6th edition. Brill; 2012.Google Scholar - Nadeu C, Macho D, Hernando J: Time Frequency and filtering of filter-bank energies for robust HMM speech recognition.
*Speech Comm*2001, 34(1):93-114.View ArticleGoogle Scholar - Patterson RD, Unoki M, Irino T: Extending the domain of centre frequencies for the compressive gammachirp auditory filter.
*J Acoust Soc Am*2003, 114(3):1529-42. 10.1121/1.1600720View ArticleGoogle Scholar - Rabiner L, Juang BH:
*Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series*. PTR Prentice Hall, New Jersey; 1993.Google Scholar - Unokia M, Irino T, Glasberg B, Moore BCJ, Patterson RD: Comparison of the roex and gammachirp filters as representations of the auditory filter.
*J Acoust Soc Am*2006, 120(3):1474-92. Available in PMC 2010 10.1121/1.2228539View ArticleGoogle Scholar - Van Immerseel LM, Martens JP: Pitch and voiced/unvoiced determination with an auditory model.
*J Acoust Soc Am*1992, 91(6):3511-3526. 10.1121/1.402840View ArticleGoogle Scholar - Wang DL, Brown GJ:
*Principles, Computational Auditory Scene Analysis: Algorithms, and Applications*. IEEE Press/Wiley-Interscience; 2006.View ArticleGoogle Scholar - Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P:
*The HTK Book (for HTK Version 3.4.1)*. Cambridge University Engineering Department, United Kingdom; 2009.Google Scholar - Zouhir Y, Ouni K:
*Speech Signals Parameterization Based on Auditory Filter Modelling. Advances in Nonlinear Speech Processing LNAI 7911, NOLISP 2013, Mons, Belgium*. Edited by: Drugman T, Dutoit T. Berlin Heidelberg: Springer; 2013: 60-66. 978-3-642-38846-0Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.