As it has been discussed in Section 1, two class problem of speech and music discrimination has been addressed by the researchers based on different low level features. But, a direct application of those schemes can not handle the problem of three way classification i.e. identification of speech, instrumental and song. The difficulty arises out of the fact that in the feature space, song has a substantial overlap with speech and instrumental as it is composed of both the components. This observation has motivated us to go for hierarchical approach. At the first stage we opt for classifying the signal into speech and music and in the subsequent stage we take up the issue of categorizing music into instrumental and song. We rely on audio texture based on ZCR and STE (Ghosal et al. 2009) and MFCC based features respectively in the two stages. Proposed audio texture provides an effective mechanism for summarizing the ZCR and STE values of all the frames in the audio signal. The computation of the features have been elaborated in Section 2.1 and 2.2. The classification scheme is described in Section 2.3.
2.1 Audio texture
The concept of texture in the domain of image processing is quite common. For an image, texture is formed by the repetition of fundamental image elements and it is evaluated by the properties like coarseness, smoothness, randomness and regularity. In an intensity image, intensity variation over a neighbourhood gives rise to the texture and co-occurrence of gray levels has evolved as a measure (Haralick and Shapiro 1991). The idea has been further extended in (Saha et al. 2004) where instead of dealing with the pixel intensity, co-occurrence of features at sub-image level has been considered to get a better perception. We have adopted the similar concept and proposed audio texture for characterizing an audio signal.
In general a speech signal occupies a limited range of frequencies in comparison to a music signal. A speech signal is typically characterized by the presence of voiced and unvoiced zone and they differ in terms of frequency and energy. Moreover, silence is quite common in a speech signal and such zones are of almost zero energy. Thus, in case of a speech signal, interleaved occurrences of voiced, unvoiced and silence zone gives rise to a pattern. In case of music, such behaviour is absent. This observation has led us to devise audio texture for speech/music classification. The zones are taken as the the fundamental elements of the speech signal and repetition of those elements leads to audio texture. As the zones are distinguished by the features like frequency and energy content, we consider ZCR and STE computed over the frames as an approximation of the zone features. The repetition pattern captured in the co-occurrence matrices of such features can act as a measure for audio texture.
In Section 1, it has been indicated that zero crossing rate (ZCR) and short time energy (STE) are two commonly used time domain, low level features which play major role in speech/music discrimination. Texture of the audio signal is generated based on those. Considering audio data as discrete signals, it is said that a zero crossing has occurred whenever two successive samples have different signs. Rate of zero crossing provides an impression regarding the frequency content. Audio signal is divided into N frames { x
i
(m): 1 ≤ i ≤ N}. Then, for ith frame, zero crossing rate is computed as follows:
(1)
n is the number of samples in the ith frame and
(2)
As the collection of frame level ZCR is of very high dimension, the audio signal is represented by the summarized information. Mean and standard deviation of { z
i
:i = 1,2,…, N} are taken as two features. Such type of representation gives only an overall idea about the signal. To obtain a better representation of the signal characteristics we have utilized the concept of co-occurrence matrix (Haralick and Shapiro 1991) which is widely used in image processing. In an image, the occurrence of the different intensity values within a neighborhood reflects a pattern and it is utilized to parameterize the appearance/texture of an image. The same concept is adopted here. For each frame, ZCR is computed using equation (1). Thus, { z
i
}, a sequence of ZCR is obtained for the signal. Occurrence of different ZCR values within a neighborhood reflects the pattern and characterizes the quasi-periodic behavior of the signal. Thus, a matrix, C of L × L dimension (where, L = max{z
i
} + 1) is formed as follows:
-
Initialize C[ i][j] = 0 ∀ i, j ∈ {0,1,…, L}
-
for i = 1 to N - dC[z
i
][zi + d] = C[z
i
][zi + d] + 1
-
where, d is the distance at which occurrence of the values are being considered. Thus, the matrix C represents distribution of pairwise occurrence of different ZCR values. It is likely that in case of a speech signal, there will be substantial co-occurrence of low ZCR denoting silence zone and high-low transition (or vice versa) for non-silence to silence (or vice versa) switching. Such transition also occurs due to interleaving of voiced and unvoiced speech. These will have a reflection in C. Music is comparatively richer in frequency content, distribution will be well spread in the matrix. Due to noise there may be little variation in the signal which may affect the co-occurrence matrix. Moreover, very close frequencies are also not perceivable to human ear.
To combat these issues, we had to go for a modified scheme to construct the co-occurrence matrix. The ZCR scale may be divided into k bins defined by the points μ
z
±t × s × σ
z
where, μ
z
and σ
z
are mean and standard deviation of { z
i
}, t takes the values 0,1,2,… and s is the step size. It is obvious that substantial contribution will be confined within μ
z
± σ
z
. Hence, to reveal the distribution characteristics in a detailed manner s is taken from (0, 1). Once the bins have been formed, z
i
s are mapped onto bins and instead of z
i
values, corresponding bin numbers are used as the index in forming the co-occurrence matrix Mk × k. From the co-occurrence matrix, M, following statistical features (Umbaugh 2005) are computed:
(3)
(4)
(5)
(6)
(7)
where,
Thus, computing these features, a 5-dimensional ZCR based feature vector is formed. It may be noted that the texture features thus obtained is a better alternative for summarizing the frame level features.
Similarly, short time energy based features are also computed. First of all, for each frame short time energy is computed as follows:
(8)
where frame contains n samples. Based on the set of STE, E
i
for the frames, the co-occurrence matrix is formed in the same manner as it has been done in case of co-occurrence matrix of ZCRs. As the range of energy values is quite high, it would have been a big problem for matrix dimension. Mapping of the absolute value to bin solves the problem. Such mapping also overcome another problem. Overall rise/fall in the amplitude level of the signal does not change the nature of the signal but affects the energy value. Mapping scheme present in this work also cancels such impact and retains the signal characteristics.
In case of a speech signal, silence zone will have minimal energy. Moreover, interleaved voiced and unvoiced speech will lead to interleaving of high and low energy. It will give a typical pattern in the co-occurrence matrix enabling us to discriminate the speech from rest. Co-occurrence matrix based features are computed to obtain 5-dimensional STE based feature vector. Figure 1 show 2-D contour plots of ZCR co-occurrence matrices and STE co-occurrence matrices. Frequency of the co-occurrence pattern has been indicated by the colour and colour code also has been shown along with. It is clear that the plots are quite different for speech and music signals. For ZCR occurrence patterns of speech signal shows a few peaks as its frequency content is limited. Music being rich in frequency content, multiple peaks becomes apparent in the form of coloured patch in the plot. For speech silence zone has almost zero energy and energy distribution is localized with in a small range of bins. In case of music, energy is distributed widely across the bins which is reflected by the coloured patch in the plot. Thus, the utility of the concept of occurrence pattern is clearly visible.
Taking ZCR and STE co-occurrence matrices based features together, a 10-dimensional feature vector is formed and it acts the descriptor for an audio signal for speech/music classification. As the proposed audio texture for music with/without voice is quite similar in nature, it can not be used for discriminating them further and it has forced us to restrict its usage in speech/music classification only.
2.2 MFCC
It has been indicated in (Zhang and Kuo 2001) that unlike song, the spectrogram of instrumental music reflects stable frequency peaks in the spectrogram. In case of song, because of the human voice, such stability is not visible. It has motivated them to think of ZCR and fundamental frequency based descriptors and to devise a threshold based classification scheme. The same observation has motivated us to look into frequency domain. In case of instrumental music, ideally the spectral power is confined around certain frequencies. Whereas, for song, it is distributed over a wider range of frequency. Song is further complex signal as it is normally accompanied by instrumental music also. Considering all these aspects, we have relied on cepstrum based feature. A cepstrum is the inverse Fourier transform of the log spectrum and the technique is particularly good at separating the components of complex signals made up of several simultaneous but different elements combined together.
Mel-frequency cepstral co-efficients are short term spectral based features used by many researchers for speech recognition (Walker et al. 2004), retrieval system (Foote 1997), music summarization (Logan and Chu 2000), speech/music discrimination (Logan 2000). The strength of MFCC is in compact representation of amplitude spectrum. The steps for computing MFCC is elaborated in (Rabiner and Juang 1993). The brief description is as follows.
The audio signal is first divided into number of frames of fixed duration. Frames may consist of samples with an overlap with the previous frame. To minimize the discontinuity at the beginning and end of the frame an windowing function (Hamming window is the most widely used one) is also applied on the frame. Amplitude spectrum for each (windowed) frame is obtained by applying Discrete Fourier Transform (DFT). As the relation between perceived loudness and amplitude spectrum is more logarithmic than linear, logarithm of amplitudes is taken. Thus, N- dimensional spectrum is obtained where N is the frame size. The spectrum is smoothened to make it perceptually meaningful. The simplest way of doing this to consider the average spectrum over the frequency bins. But eqi-spaced bin over the frequency scale does not conform the human auditory system as the perceived frequency and the signal frequency are not linearly related. It has led to the development of Mel frequency. The relation can be expressed as follows.
where, f and f
m
are signal frequency and corresponding Mel frequency respectively. The mapping is approximately linear below 1kHz and logarithmic above. Thus, logarithm of amplitude spectrum obtained after DFT is mapped on to Mel-frequency scale and smoothened by considering the bins over the Mel-scale. The elements in the smoothened Mel-spectra vector are highly correlated. To decorrelate and to reduce the number of parameters Discrete Cosine Transform (DCT) is performed as follows.
where
where, 0 < n < N
T
- 1, S[ k] be the Smoothened Mel-spectrum and N
T
is the number of elements in the smoothened Mel-spectra vector. k[ n], thus obtained denotes the Mel-frequency cepstral co-efficient and first 13 co-efficients are taken as the features for the frame.
After computing the MFCCs for all the frames, the vector comprising of the average value corresponding to each co-efficient forms the feature descriptor. It may be noted that each Mel-frequency cepstral co-efficient, k[ n] is obtained after DCT of log-spectrum and hence, it captures the weighted combination of all spectral component. As a result, even if a limited number of co-efficients are taken as features, signature of the complete frequency spectrum is still embedded in them. Thus, MFCC provides a compact representation of the amplitude spectrum of a signal.
It has been already mentioned that song is complex in nature in comparison to instrumental and unlike song, instrumental is characterized by stable frequency peaks in the spectrogram (Zhang and Kuo 2001). The presence of perceivable amplitude over a wider range of frequency in a song is well reflected by the presence of more number of strong peaks in the MFCC plots as shown in Figure 2. As the plots for instrumental and song is quite different, MFCC can be utilized in classifying the music signals in the sub categories in the second stage of the proposed scheme. On the other hand, the MFCC plots for speech signals, there are few stable peaks making it similar to instrumental signal and also a number of weak peaks which may overlap with song signal. Thus, it may create confusion with instrumental/song signal in various cases. As a result its success in discriminating speech from rest is limited and we have avoided using it at first stage. Moreover, it can not help in a direct classification of speech, instrumental and song.
2.3 Classification scheme
The audio signal, be it speech or instrumental or song, may have a wide variety making the task of classification critical. In this context a direct threshold based approach is quite prohibitive. As a result, many researchers have relied on the classification schemes like neural network, SVM and HMM. The variation even within a class of audio signal gives rise to outliers putting detrimental bias in classification. Furthermore, each classifier has its own set of parameters which are not always readily perceivable and performance depends heavily on the proper selection of such parameters. In order to achieve optimal performance, tuning of parameters like kernel width is very critical for SVM. On the other hand, in case of HMM, finding out number of states, state transition probability, distinct observation per state etc. are not at all trivial task. It has motivated us to look for an estimator which is characterized by the the parameters those are easy to interpret and tune and capable of handling the diversity of data satisfactorily. In this context, RANdom Sample And Consensus (RANSAC) appears as a suitable alternative that can model the diversified data even in presence of considerable outliers. Successful application of the scheme in the domain of image processing (Torr and Zisserman 2000; Zuliani et al. 2005) has also motivated us to apply it for audio classification.
RANSAC (Fischler and Bolles 1981) is an iterative method to estimate the parameters of a certain model from a set of data contaminated by large number of outliers. The strength of RANSAC over other estimators lies in the fact that the estimation is made based on inliers i.e. whose distribution can be explained by a set of model parameters. It can produce reasonably good model provided a data set contains a sizable amount of inliers. It may be noted that RANSAC can work satisfactorily even with outliers amounting to 50% of entire data set (Zuliani 2005).
RANSAC algorithm is primarily composed of two steps – hypothesize and test. These are executed iteratively. During hypothesize phase, a minimal sample set is randomly selected and model parameters are computed based on only the elements in the selected sample set. In the test phase, consistency of all the elements in the entire data set are verified to check whether or not they are consistent with the model obtained. Consistent elements are included to form the new sample set. The process goes on iteratively and in each iteration a model is obtained. Finally, the model that fits best is taken as the estimate.
Considering the data elements to be n-dimensional, RANSAC tries to fit a hyperplane and estimates the model parameters. Let the hyperplane is represented as
where, < d1, d2,… d
n
> be the n-dimensional point. It estimates the value of w
i
s by minimizing the fitting error for each element in the entire data set. An element e
i
is considered as an inlier/consistent with respect to the model provided its orthogonal regression, d(e
i
, W) with the model is within the threshold δ; where,
and e
i
= < ei 1, ei 2 … e
in
>. In our experiment, δ is taken as 0.02 as suggested in (Fischler and Bolles 1981). is taken as the total fitting error for the model under consideration. The model with minimum error is considered as the final one.
Classically, RANSAC is an estimator for the parameters of a model from a given data set. But, in this work, it has been used a classifier. Corresponding to the data set of each category a model is estimated first. Then for a given element, its class can be determined easily by finding the best matched model. As it has been discussed that RANSAC estimates the model relying on the inliers, unlike other technique, it is less affected by the noisy data. Thus, RANSAC is well suited for our purpose.
It has been already discussed that audio texture can discriminate only speech and music. It can not further distinguish a music signal as instrumental/song. On the other hand MFCC plots for instrumental and song are quite distinct but the same for speech introduces confusion. Hence at first stage RANSAC models the signals as speech or music considering audio texture as the feature and classification is carried out. At the next stage, RANSAC further classifies the detected music signals into song/instrumental based on the model formed using MFCC as the feature. Thus, a hierarchical classification scheme is followed.