- Open Access
DWT features performance analysis for automatic speech recognition of Urdu
© Ali et al.; licensee Springer. 2014
Received: 7 January 2014
Accepted: 10 April 2014
Published: 27 April 2014
This paper presents the work on Automatic Speech Recognition of Urdu language, using a comparative analysis for Discrete Wavelets Transform (DWT) based features and Mel Frequency Cepstral Coefficients (MFCC). These features have been extracted for one hundred isolated words of Urdu, each word uttered by ten different speakers. The words have been selected from the most frequently used words of Urdu. A variety of age and dialect has been covered by using a balanced corpus approach. After extraction of features, the classification has been achieved by using Linear Discriminant Analysis. After the classification task, the confusion matrix obtained for the DWT features has been compared with the one obtained for Mel-Frequency Cepstral Coefficients based speech recognition. The framework has been trained and tested for speech data recorded under controlled environments. The experimental results are useful in determination of the optimum features for speech recognition task.
Typical parameters for ASR complexity
Isolated words to continuous speech
Read speech to spontaneous speech
Speaker-dependent to speaker-independent
Small (20 words) to large (20,000 words)
Finite-state to context-sensitive
Small (10) to large (100)
High (30 dB) to low (10 dB)
Voice-cancelling microphone to telephone
Vowels in english
Besides the sophisticated language resource for these languages, one of the optimization tasks for the realization of a more robust ASR system has been the extraction of features which are robust against noise. Although the Mel Frequency Cepstral Coefficients (MFCC) and the Linear Predictive Coding (LPC) based features (Hachkar et al.2011; Han et al.2006) have been very famous for speech recognition applications, the basic approach for these features extraction has always been based upon Short Time Fourier Transform (STFT). The features extraction based on STFT has an inherited assumption that the audio signal remains stationary throughout the period of analysis. This, in fact, has a lack of compliance to the actual scenario. Furthermore, in order to guarantee the signal to be stationary, short window duration may be used resulting in high time resolution but poor frequency resolution. Similarly, if the window duration is increased, this may improve the frequency resolution but will degrade the time resolution of the representation. The fixed window size results in a fixed resolution of the time-frequency representation of the STFT. Thus, research has been directed towards the use of Wavelet Transforms for feature extraction (Tan et al.1996; Chang et al.1998). This has been a source of inspiration to develop a speech recognition framework for Urdu, based upon the new Discrete Wavelet Transform based features. The lack of resource has been a practical bottleneck to drive the research work on Urdu language and speech processing. As mentioned by (Hussain2004) and (Raza et al.2009), Urdu is mostly written without the use of diacritics as this is the common practice by the native users. This, however, results in complexity to map the letters to sound as the diacritics represent the vowels in Urdu. Similarly for research on Urdu speech recognition, lack of enough resources on standard set of phonemes, standard speech corpus and language models have been the major challenges.
This paper presents the work on the ASR of Urdu isolated words and investigate the performance of DWT features by comparing it with the results of MFCCs. Given a carefully selected corpus and experimental conditions, this work provides a stronger baseline for future research on Urdu ASR. The remainder of this paper is organized as follows; In Section ‘Related work’, a brief overview of the research work done for development of Urdu ASR resource and framework is presented. Section ‘Overall block diagram’ briefly presents an overview of a typical speech recognition framework. In Section ‘Feature extraction by discrete wavelet transform’, the DWT features extraction has been discussed in detail. The classification achieved via LDA has been presented in Section ‘Classification’. The experimental setup and the data used in the experiment has been discussed in Section ‘Experiment’ while a comparative presentation of the experimental results has been made in Section ‘Results and comparisons’. Finally, Section ‘Conclusion and future work’ concludes the paper.
It has not been until recently that research on speech processing of Urdu has been the topic of discussion for researchers. This includes the efforts made for corpus development as well as those towards the development of Urdu ASR. Unlike other developed languages, sophisticated categorization and resources are unavailable for Urdu, however, a basic introduction can be found in (Hussain2004; Intermediate Urdu2012). Raza et al. (2009;2010) have made significant contribution to the development of Urdu ASR. Firstly, in (Raza et al.2009), a speech corpus has been developed for Urdu, which is context based and phonetically rich covering all the 62 phonemes. The goal is to achieve corpus, phonetically rich and not necessarily phonetically balanced. Thus phonetic cover has been achieved but phonetic balance has not been guaranteed. Phonetic cover means that the corpus covers all the phonemes of the language while phonetic balance ensures that these phonemes occur in the corpus maintaining the ratio of occurrence in the language itself (Pineda et al.2004). Then, in (Raza et al.2010), they have developed ASR for spontaneous speech mixed with read speech of Urdu. The CMU Sphinx Toolkit (CMU Sphinx2012) platform has been used for training and testing purpose. The system was trained with 87 minutes of spontaneous speech data and 70 minutes of read speech data while the testing was performed using 22 minutes of spontaneous speech data non-overlapping with the training data. The resulting Word Error Rate (WER) has a range of values for different ratios of spontaneous versus read speech in the training data. For a 0:100 ratio, the WER is 58.4, but it has significantly increased with the increase in the amount of spontaneous data, reaching a value of 18.8 for a 1:1 ratio of spontaneous vs read speech data. However, the results are based on single speaker speech recognition and extensive enhancements are required to transform the system into a multi-speaker system. (Sarfraz et al.2010a;2010b) has also used CMU Sphinx Toolkit towards Large Vocabulary speech recognition of Urdu. The goal was to cover the everyday speech; however, the variety in Urdu accents has not been covered as the target speech is mostly limited to suburban accent spoken in offices and homes. Furthermore, the Word Error Rates are too high for multiple speaker sets. Irtza and Hussain (2012) has presented the possibilities of improving the word error rates by using the approach of monitoring the word error rate improvement with increasing the training data for particular phonemes. The analysis is once again, limited to single speaker speech recognition system only. (Ali et al.2012) has presented the development of a medium vocabulary corpus for isolated words of Urdu. The corpus comprises of 250 isolated words in Urdu, uttered by 50 speakers, with a balanced contribution from native and non-native, male and female speakers of a variety of age ranging from 20 years to 50 years. The corpus also covers various accents of Urdu as speech data of speakers from a variety of origin has been included. In (Akram and Arif2004), the Mel-Frequency Cepstral Coefficients (MFCCs) have been extracted i.e. 39 features for a single frame of 15 milliseconds, comprising of 12 MFCCs, 12 MFCC delta features, 12 MFCC delta-delta coefficients, one 0th order cepstral coefficient and two log energy coefficients. The overall recognition rate is limited to 54 percent only. The paper lacks information on the toolkit used for the development of the framework. (Ashraf et al.2010) has used the popular Hidden Markov Models (Rabiner1989) for ASR of small vocabulary isolated Urdu words. The recognition performance has been reported to be very good with a mean Word Error Rate of 10.66%. Amongst the three models namely context-free-grammar, the n-gram grammar and the wordlist grammar, the simplest model i.e. the wordlist grammar model has been used. This model treats each word as a single phoneme instead of breaking it into sub-units. In the review work by (Ghai and Singh2012), it has been mentioned that Urdu has 28 consonants and 10 vowels. (Ghai and Singh2012) has also summarized a detailed review on the various works done in the area of Urdu ASR. The above mentioned research has been helpful to establish a baseline for future research work on Urdu ASR. However, ASR performance for DWT based features has not yet been explored for Urdu. This work presents the use of DWT based features for Urdu ASR and compares the recognition performance of the framework for DWT features with the one using MFCCs. The dataset used for the training and testing of both the frameworks is the same and both the frameworks incorporate Linear Discriminant Analysis for classification purpose.
Overall block diagram
After the noise-removal and pre-emphasis are accomplished, the input signal is provided to the feature extraction block to calculate the DWT Features.
Feature extraction by discrete wavelet transform
Discrete wavelet transform
For isolated words recognition, a primary assumption in this work is that the phoneme information has been retained after splitting a single isolated word. As a result of the DWT decomposition of the given word, the higher frequency spectral part is separated from the lower frequency spectrum. As a rule of thumb, a sampling frequency of 16 kHz has been used. A first level decomposition provides the frequency contents of 0-4 kHz and 4-8 kHz. A second level decomposition provides the frequency contents of 0-2 kHz, 2-4 kHz, and 4-8 kHz. Similarly, a third level decomposition provides the frequency contents of 0-1 kHz, 1-2 kHz, 2-4 kHz, and 4-8 kHz. Once the distribution of the speech data for a particular isolated word over different frequency bands has been accomplished, the energy for each component of the signal in the different frequency bands is determined. An essential normalization is performed on the energy values of each frequency band, by the number of samples in the respective energy band. This makes sense as the number of samples in each frequency band are not essentially uniform (Chang et al.1998). The average energies of the different bands are the features on which the classification is based. For each single word, a total of 32 features have been obtained. These features provide the energy in each band as well as information on the temporal variation of the energy in each band.
A supervised classification technique has been used for the word recognition task. This scenario suggests that every isolated word is a member of a pre-determined class. The classification has been achieved using Linear Discriminant Analysis (LDA) (Balakrishnama et al.1999; Balakrishnama and Ganapathiraju1998).
Linear discriminant analysis
Following this reasoning, a trade-off is made for loosing decision on curved boundaries; however, memory requirements are reduced, as linear discriminant function reduces the dimensionality of the covariance matrices from d - b y - d to d - b y - 1. Besides, the computation period is also considerably reduced.
Representation of speaker attributes
Results and comparisons
Comparison: a word-to-word case
Comparison of percentage error for DWT features and MFCCs - first ten words
Σ SA DWT
Σ SA MFCC
Overall classification results comparison
Where, α100 is percentage of words with 100% error, α66.67 is the percentage of words with 66.67% error, α33.33 is the percentage of words with 33.33% error, and α0 is the percentage of words with zero error. N T is the total amount of test data used. This calculation gives the value of overall error, E = 60.896%. This indeed is a very higher value as compared to E = 29.33%, achieved by using MFCCs as obvious from Table4.
Conclusion and future work
In this work, the ASR for a medium vocabulary of Urdu isolated words has been presented. The framework can be extended to large vocabulary applications. The ASR framework for isolated words of Urdu provides a good foundation for an extended development on continuous speech recognition framework, robust against noisy environment. The experimental results for the overall percentage error rate show that the recognition performance for DWT based features has not been promising. On the other hand, the MFFCs based classification has shown relatively better results for the same dataset. The proposed system is based on limited training data and the performance can be improved further by increasing the amount of training data. It is of key importance to mention that the results and figures presented in this work are for speech data recorded under controlled environment. Thus, a more comprehensive future task is to enhance the system and perform the training and testing for more practical speech data under noisy environments.
We are thankful to all the volunteers who participated in the corpus development by recording the speech data. We are also thankful to the anonymous reviewer whose comments helped in improvement of the quality of this paper. Thanks to Mr. Hafeez Anwar, TU Vienna for useful discussion and feedback.
- Ali H, Ahmad N, Yahya KM, Farooq O: A medium vocabulary Urdu isolated words balanced corpus for automatic speech recognition. In Proceedings of 4th International Conference on Electronic Computer Technology, ICECT. Kanyakumari, India,6–8April 2012; 2012:473-476.Google Scholar
- Ali H, Ahmad N, Zhou X, Ali M, Manjotho AA: Linear discriminant analysis based approach for automatic speech recognition of Urdu isolated words. In International Multitopic Conference (IMTIC’13). Jamshoro Pakistan, 18-20 December 2013; 2013.Google Scholar
- Akram MU, Arif M: Design of an urdu speech recognizer based upon acoustic phonetic modeling approach. Proceedings of 8th International Multitopic Conference, INMIC 2004, Lahore, Pakistan, 24-26 December 2004 91–96 2004.Google Scholar
- Ashraf J, Iqbal N, Khattak NS, Zaidi AM: Speaker independent Urdu speech recognition using HMM. In: Proceedings of The 7th International Conference on Informatics and Systems (INFOS), 2010. Cairo, pp 1–5Google Scholar
- Balakrishnama S, Ganapathiraju A: Linear discriminant analysis; a brief tutorial. 1998.http://www.music.mcgill.ca , Accessed February 2012Google Scholar
- Balakrishnama S, Ganapathiraju A, Picone J: Linear discriminant analysis for signal processing problems. Proceedings of IEEE Southeastcon, IEEE, Lexington, KY, 25-28 March 1999 1999, 78-81.Google Scholar
- Center for Language Engineering 2012.http://www.cle.org.pk Accessed February, 2012
- Chang S, Kwon Y, Yang S-I: Speech feature extracted from adaptive wavelet for speech recognition. Electron Lett 1998, 34(23):2211-2213. 10.1049/el:19981486View ArticleGoogle Scholar
- CMU Sphinx 2012.http://www.speech.cs.cmu.edu/ . Accessed February, 2012
- Criado C, Rabal H, Cap N, Holodiagrams A: Decision and classification problems using Mahalanobis statistical distance. In 2011 eight international conference on Fuzzy Systems and Knowledge Discovery (FSKD), vol. 1.. Shanghai, China, 26-28 July 2011; 2011:1012-10162.View ArticleGoogle Scholar
- Farooq O, Datta S: Phoneme recognition using wavelet based features. Elsevier Inf Sci 2003, 150: 5-15. 10.1016/S0020-0255(02)00366-3View ArticleGoogle Scholar
- Ghai W, Singh N: Analysis of automatic speech recognition systems for indo-aryan languages: Punjabi a case study. Int J Soft Comput Eng 2012, 2(1):379-385.Google Scholar
- Gowdy JN, Tufekci Z: Mel-scaled discrete wavelet coefficients for speech recognition. In 2000 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ‘00. Turkey,5–9June 2000; 2000:1351-13543.Google Scholar
- Hachkar Z, Mounir B, Farchi A, Abbadi JE: Comparison of MFCC and PLP parameterization in pattern recognition of arabic alphabet speech. Can J Artif Intell, Mach Learn Pattern Recognit 2011, 2(3):56-60.Google Scholar
- Han W, Chan C-f, Choy C-s, Pun K-p: An efficient mfcc extraction method in speech recognition. In 2006 IEEE international symposium on circuits and systems. Island of Kos: IEEE; 2006:145-148.Google Scholar
- Hussain S: Letter-to-sound conversion for urdu text-to-speech system. Workshop on computational approaches to arabic script-based languages, COLING 2004 2004.Google Scholar
- Intermediate Urdu 2012.http://urdu.wustl.edu/urdu-script.php . Accessed February 19, 2012
- Irtza S, Hussain S: Error analysis of single speaker Urdu speech recognition system. In Conference on Language and Technology, CLT 2012. Lahore, Pakistan, 9-10 November 2012; 2012.Google Scholar
- Long CJ: Phoneme Discrimination using non-linear wavelets methods. PhD thesis, Loughborough University 1999.Google Scholar
- Long CJ, Datta S: Wavelet based feature extraction for phoneme recognition. In Proceedings of 4th international conference of spoken language processing. Philadelphia, USA; 1996:264-267.View ArticleGoogle Scholar
- Long CJ, Datta S: Discriminant wavelet basis construction for speech recognition. In Proceedings of 5th international conference of spoken language processing. Sydney, Australia; 1998:1047-10493.Google Scholar
- Lukasia E: Wavelet packets based features selection for voiceless plosives classification. Proceedings of IEEE International Conference on Acoustic, Speech and Signal Processing, ICASSP ‘00 2000, 689-6922.Google Scholar
- Mallat S: A wavelet tour of signal processing. Academic Press, USA; 1999.Google Scholar
- Pineda LV, Gomez MM-y, Vaufreydaz D, Serignat J-f: Experiments on the construction of a phonetically balanced corpus from the Web. In CICLing. Seoul, Korea, 15-21 February 2004: Springer; 2004:416-419.Google Scholar
- Rabiner LR: A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 1989, 77(2):257-286. 10.1109/5.18626View ArticleGoogle Scholar
- Raza AA, Hussain S, Sarfraz H, Ullah I, Sarfraz Z: Design and development of phonetically rich urdu speech corpus. In 2009 oriental COCOSDA international conference on speech database and assessments. Urumqi, China, 10-12 August 2009; 2009.Google Scholar
- Raza AA, Hussain S, Sarfraz H, Ullah I, Sarfraz Z: An ASR system for spontaneous Urdu speech. In Oriental COCOSDA 2010 conference. Nepal, 24-25 November 2010; 2010:1-6.Google Scholar
- Sarfraz H, Hussain S, Bokhari R, Raza A, Ullah I, Sarfraz Z, Pervez S, Mustafa A, Javed I, Parveen R: Speech corpus development for a speaker independent spontaneous Urdu speech recognition system. Proceedings of the O-COCOSDA, Kathmandu, Nepal 2010a. O-COCOSDAGoogle Scholar
- Sarfraz H, Hussain S, Bokhari R, Raza AA, Ullah I, Sarfraz Z, Pervez S, Mustafa A, Javed I, Parveen R: Large vocabulary continuous speech recognition for Urdu. In Proceedings of the 8th international conference on frontiers of information technology - FIT ‘10. Islamabad, Pakistan, 21-23 November, 2010; 2010b:1-5.View ArticleGoogle Scholar
- Shen C, Kim J, Wang L: Scalable large-margin mahalanobis distance metric learning. IEEE Trans Neural Netw 2010, 21(9):1524-1530.View ArticleGoogle Scholar
- Tan BT, Fu M, Spray A, Dermody P: The use of wavelet transforms in phoneme recognition. Fourth International Conference on Spoken Language, ICSLP 96 1996, 2431-24324.Google Scholar
- Tufekci Z, Gowdy JN: Feature extraction using discrete wavelet transform for speech recognition. In IEEE Southeastcon. USA,9–9April 2000; 2000:116-123.Google Scholar
- Varile G, Zue V, Cole R, Ward W: Survey of the state of the art in human language technology. Cambridge University Press, England; 1995.Google Scholar
- Wassner H, Chollet G: New cepstral representation using wavelet analysis and spectral transformation for robust speech recognition. In Fourth International Conference on Spoken Language, ICSLP 96. Philadelphia, USA; 1996:260-2631.Google Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.