Pathological brain detection in MRI scanning by wavelet packet Tsallis entropy and fuzzy support vector machine

An computer-aided diagnosis system of pathological brain detection (PBD) is important for help physicians interpret and analyze medical images. We proposed a novel automatic PBD to distinguish pathological brains from healthy brains in magnetic resonance imaging scanning in this paper. The proposed method simplified the PBD problem to a binary classification task. We extracted the wavelet packet Tsallis entropy (WPTE) from each brain image. The WPTE is the Tsallis entropy of the coefficients of the discrete wavelet packet transform. The, the features were submitted to the fuzzy support vector machine (FSVM). We tested the proposed diagnosis method on 3 benchmark datasets with different sizes. A ten runs of K-fold stratified cross validation was carried out. The results demonstrated that the proposed WPTE + FSVM method excelled 17 state-of-the-art methods w.r.t. classification accuracy. The WPTE is superior to discrete wavelet transform. The Tsallis entropy performs better than Shannon entropy. The FSVM excels standard SVM. In closing, the proposed method “WPTE + FSVM” is effective in PBD.

translation-variant, hence, the coefficients behaved unpredictably if the input signal is translated slightly. In PBD problem, the subject's head usually have slightly move during the scan, which will cause the translation of MR images.
Another problems is the classifier. Current scholars tend to use either artificial neural network (ANN) or support vector machine (SVM). Nevertheless, both of them are sensitive to outliers and noises. That means, if the training set contains noises or outliers, the classifier will still treat it as important as normal data.
We suggested three improvements with the aim of solving above problems. First, we employed the discrete version of wavelet packet transform (WPT), which is an extension of standard discrete wavelet transform (DWT). Second, we introduced Tsallis entropy (TE), to replace with Shannon entropy (SE). (iii) We introduced the fuzzy support vector machine (FSVM) that combines the SVM with fuzzy logic approach (Ashkezari et al. 2013) and has the advantage of reducing the effect from outliers and noises.
The structure of the rest is organized as follows. "State-of-the-art" presents the stateof-the-art. "Materials" introduces the materials used in this study. "Feature extraction" discusses the features. "Classifier" gives the classifier. "Implementation and experiments" shows the implementation of the whole method, and designs the experiments. "Results and dicussion" contains the results and discussions. "Conclusion and future research" offers conclusion and future research. We explain the nomenclatures in Abbreviations at the end of the paper. Chaplot et al. (2006) was the first to solve PBD problem. They used the approximation coefficients from DWT, and utilized the support vector machine (SVM) and self-organizing map (SOM). El-Dahshan et al. (2010) extracted all coefficients of all subbands of a three-level discrete wavelet transform (DWT). Then, they reduced the size of features by principal component analysis (PCA). Finally, two classifiers, K-nearest neighbors (KNN) and feed-forward back-propagation ANN (FP-ANN), were employed. Wu and Wang (2011) followed EI-Dahshan's method, but suggest to use a feed-forward neural network (FNN) as the classifier, which was trained by scaled chaotic artificial bee colony (SCABC). Dong et al. (2011) proposed to employed scaled conjugate gradient (SCG) method to take place of SCABC. Zhang and Wu (2012) suggested to utilize kernel support vector machine (KSVM). 3 kernels were provided such as homogeneous and inhomogeneous polynomial, and radial basis function (RBF). Das et al. (2013) developed a novel method as Ripplet transform (RT) + principal component analysis (PCA) + least square support vector machine (LS-SVM). Their five-fold cross validation results showed promising classification accuracies. Saritha et al. (2013) proposed a novel feature of wavelet-entropy (WE), and employed spider-web plots (SWP) to further reduce features. Afterwards, they used the probabilistic neural network (PNN). Yu et al. (2015d) commented on Saritha's paper and stated that dropping the SWP can obtain the same results. Zhang et al. (2013) suggested to use particle swarm optimization to train KSVM. Padma and Sukanesh (2014) used combined wavelet statistical texture features, to segment and classify AD benign and malignant tumor slices. El-Dahshan et al. (2014) used the feedback pulse-coupled neural network for image segmentation, the DWT for features extraction, the PCA for reducing the dimensionality of the wavelet coefficients, and the FBPNN to classify inputs into normal or abnormal. Wang et al. (2014) used kernel support vector machine decision tree. Zhou et al. (2015) used wavelet-entropy as the feature space, then they employed a Naive Bayes classifier (NBC) classification method. Their results over 64 images showed that the sensitivity of the classifier was 94.50 %, the specificity 91.70 %, the overall accuracy 92.60 %. Damodharan and Raghavan (2015) combined tissue segmentation and neural network for brain tumor detection. Yang et al. (2015) selected wavelet-energy as the features, and introduced biogeography-based optimization (BBO) to train the SVM. Their method reached 97.78 % accuracy on 90 T2-weighted MR brain images. Nazir et al. (2015) suggested to use filters for the removal of noises, and extracted color moments as mean features. Finally, they achieved an overall accuracy of 91.8 %. Dong et al. (2015) suggested to use a 3D eigenbrain method to detect subjects and brain regions related to AD. The accuracy achieved 92.36 ± 0.94. Harikumar and Kumar (2015) analyzed the performance of ANN, in terms of classification of medical images, using wavelets as feature extractor. Their classification accuracy achieved 96 %. Wang et al. (2015a) suggested to use stationary wavelet transform (SWT) to replace DWT, and then they proposed a Hybridization of Particle swarm optimization and Artificial bee colony (HPA) algorithm to train the classifier. Farzan et al. (2015) used longitudinal percentage of brain volume changes (PBVC) in two-year follow up and its intermediate counterparts in early 6-month and late 18-month as features. Their experiment results obtained accuracy of 91.7 %. Munteanu et al. (2015) employed Proton Magnetic Resonance Spectroscopy (MRS) data, with the aim of detecting MCI and AD. They used a single-layer perceptron with only two spectroscopic voxel volumes obtained in the left hippocampus, with an AUROC value of 0.866. Zhang et al. (2015d) combined wavelet entropy with Hu moment invariants (HMI). The feature number is in total 14. They also used GEPSVM as the classifier.

Magnetic resonance brain image dataset
Three benchmark magnetic resonance brain image datasets with various image numbers: D-66, D-160, and D-255, were were downloaded from the website of Harvard University. Those data contain T2-weighted images obtained along axial plane. Their sizes are all 256 × 256. Those three datasets are commonly used in PBD test. Except healthy brain images, D-66 and D-160 consisted of 7 types of brain diseases: AD, AD plus visual agnosia, glioma, meningioma, sarcoma, Huntington's disease (HD), and Pick's disease (PiD). D-255 introduced four other diseases as cerebral toxoplasmosis, subdural hematoma (SDH), multiple sclerosis (MS), and herpes encephalitis. Figure 1 shows samples of brain MR images.
The costs of two kinds of misclassifications are different. The cost of predicting a pathological brain to a healthy one is very serious. It will defer the necessary treatment, whereas the misprediction of a healthy brain to a pathological one can be second-checked by other techniques. Hence, we intentionally create the three imbalanced datasets, which covers more pathological brains than usual, so the PBD system is biased to detect pathological ones, with the aim of addressing this cost-sensitive task.

Statistical setting
Cross validation (CV) is commonly used for statistical test. Stratification is embedded to CV so that each fold contains nearly the same class distributions. In this work, six-fold stratified CV (SCV) was utilized for the smallest dataset (D-66), and five-fold SCV for the other datasets (D-160 and D-255). Table 1 lists the SCV setting of all datasets.

Feature extraction
Co-registration was unnecessary since many publications about PBD did not use it with excellent classification results, comparative with the results that employed coregistration (Ribbens et al. 2014;Schwarz and Kasparek 2014).
i AD with visual agnosia j Herpes encephalitis k Cerebral toxoplasmosis l MS Fig. 1 Sample of magnetic resonance brain image dataset a Healthy brain, b Meningioma, c Glioma, d Sarcoma, e SDH, f PiD, g AD, h HD, i AD with visual agnosia, j Herpes encephalitis, k Cerebral toxoplasmosis, l MS

Wavelet packet transform
Compared to standard discrete wavelet transform (DWT), the wavelet packet transform (WPT) is an extension where the signal is passed through more filters than DWT. The DWT calculate each level by passing only the previous approximation coefficients to quadrature mirror filters (QMF). Nevertheless, the WPT passes all coefficients (both approximation and detail) through QMF to create a full binary tree. Therefore, more features can be generated by WPT at different levels to obtain more information. The mathematical equation of WPT is given below where m represents the index of channel, p the position parameter, d the decomposition level, ψ the wavelet function, and S the decomposition coefficients. 2 d sequences will be yielded at the d level. The decomposition equations of next level is provided as Suppose a d-level decomposition, DWT produces (3d + 1) coefficient sets, while the WPT produces 2 d different coefficients sets. Note that the number of coefficients of WPT is still the same of DWT, because of the downsampling process (Fig. 2).

Shannon and Tsallis entropy
Shannon entropy (SE) is defined as a measure of uncertainty regarding the information content (IC): here E represents the entropy, Z the total number of greylevels, k the greylevel, and pk the probability of k. Shannon entropy can merely describes scenarios with simple effective microscopic interactions and short-ranged microscopic memory (Campos 2010). Assume a physical system can be broken down into two independent subsystems X and Y, then the Shannon entropy (SE) exists the additivity property as

Fig. 2 Flowchart of 2-level 1D-WPT
Nevertheless, realistic scenarios are usually usually involved with long-time memory and long-range interactions, therefore, Tsallis (2009) proposed a generalization of SE. He termed it as Tsallis entropy (TE) with following form here q is a real number, representing the nonextensivity degree. For a statistical dependent system, the Tsallis entropy (TE) is defined as (Zhang and Wu 2011) This equation obeys the pseudo additivity rule. Further, three different entropies can be deduced and listed in Table 2, when q is assigned with different values (Tsallis 2011). In this study, TE was employed to extract features from 16 subbands of WPT coefficients of MR brain images.

Wavelet packet Tsallis entropy
We employed both Shannon entropy (SE) and Tsallis entropy (TE) to extract waveletpacket decomposition coefficients. The final extracted features were dubbed as Wavelet Packet Tsallis Entropy (WPTE), which degraded to Wavelet Packet Shannon Entropy (WPSE) when q equals to 1. The pseudocodes of feature extraction were listed in Table 3.

Support vector machine
Let us suppose there is an N-size training samples of p-dimensional vector in two classes (−1 or +1), and the goal is to create a (p − 1)-dimensional hyperplane. Assume the dataset takes the form of (Wang et al. 2014) Standard extensive entropy (Shannon entropy)

Algorithm: WPTE Extraction
Step A Import a brain image Step B Implement a two-level WPT decomposition Step C Extract the Tsallis entropy over each coefficient set Step D Output the 16-element WPTE vector where y n takes the value of −1 for class −1, or +1 for class +1. The x n denotes a training point that is a p-dimensional vector (Zhang et al. 2013). The maximum-margin hyperplane that separates the two classes is the desired SVM. Considering any hyperplane is in the form of wx − b = 0, we need to select the optimal b and w, with the aim of maximizing the distance between the two parallel hyperplanes, while it can yet separate the data of the two classes.
Positive slack vector ξ = (ξ 1 , …, ξ n , …, ξ N ) are utilized to measure the misclassification rate of sample x n (the distance between the margin and the vectors x n on the wrong side). The optimal hyperplane can be deduced by solving: where C represents the error penalty and e a vector of ones of N-dimension. Therefore, the optimization turns to a trade-off between a large margin and a small error penalty. The constraint optimization problem can be solved using "Lagrange multiplier" as The min-max problem is not easy to solve, so dual form technique is commonly proposed to solve it as The key advantage of the dual form function is that the slack variables ξ n vanish from the dual problem, with the constant C appearing only as an additional constraint on the Lagrange multipliers.

Fuzzy SVM
Fuzzy SVM (FSVM) is more effective than standard SVM in predict or classify realworld data, in which a part of training points are less important than other points. We would like to force that the meaningful training points must be classified correctly and meaningless points like noises or outliers can be treated with less weight (Lin and Wang 2002).
FSVM applies a fuzzy membership function (FMF) s to each training data (Xian 2010), so that the training set is transformed into a fuzzy set, which can be expressed as where s = (s 1 , s 2 , …, s N ) represents the fuzzy membership vector. A smaller s n reduces the effect of the parameter ξ n , such that the corresponding point x n is treated less important. In a similar way, we construct the Lagrangian Again, the dual form is used to transform problem (15) to

Fuzzy membership
We set the FMF as a distance function between the point and its class center. Suppose the mean of class +1 as x + and the mean of class −1 as x − . Then we can get the radius of two classes as The fuzzy membership s n is defined as a function of the radius and mean of each class (Lin and Wang 2002) where δ > 0 is used to guarantee s n > 0.

Implementation and experiments
Implementation Figure 3 shows the diagram of the proposed PBD system. In the offline learning phase, the users expect to select the optimal q (to determine the value of q*), and train the (13) (x n , s n , y n )|x n ∈ R p , 0 < s n ≤ 1, y n ∈ {+1, −1} , n = 1, . . . , N classifier. In the online prediction phase, the users will get the prediction results for each query image.

Experiment design
In this study, we developed four different methods. "WPSE + SVM", "WPSE + FSVM", "WPTE + SVM", and "WPTE + FSVM". Theoretically, the last one will perform the best since WPSE in a special case of WPTE, and FSVM is an extension of SVM with additional ability to reduce influences from noises and outliers. We need to prove it by experiments. In this work, we designed five tasks. (1) We gave a comparison between DWT and WPT. A healthy brain and a pathological brain were used. We use a 2-level Haar wavelet decomposition. (2) We compared the proposed WPSE and WPTE features with traditional DWT and "DWT + PCA". All used SVM as classifiers (3) We compared the four proposed classifiers, to check whether FSVM is superior to SVM. (4) We selected the best of proposed methods, and compared it with state-of-the-art approaches. (5) We used grid searching to find the optimal parameter of q.

Results and discussions
The experiments were carried out on the platform of IBM machine with 3 GHz core i3 processor and 8 GB random access memory (RAM), running under Windows 7 operating system (OS). The algorithm was developed by ourselves based on the platform of Matlab 2014a (The Mathworks ©).

WPT versus DWT
In the first experiment, we compared DWT with WPT on a healthy brain and an Alzheimer's disease brain, respectively. The second column shows the original image, the third column the DWT decomposition results, and the final column the WPT results. Pink colormap is employed for better view (Fig. 4).

Feature comparison
In the second experiment, we compared the proposed WPSE and WPTE (q is set to 0.8, please refer to "Optimal parameter q"), with two types of traditional features: (i) DWT and (ii) DWT + PCA. (Note that Chaplot et al. (2006) proposed the DWT + SVM method, Zhang and Wu (2012) proposed DWT + PCA + SVM method). For fair comparison, we choose the same classifier-SVM.  (3) Brain images entail long-range interaction and fractal-type structure, because of the self-similarity observed brain structures imaged with a finite resolution, which can be easily extracted by the corresponding wavelet packet coefficients. In summary, there are similarities at different spatial scales in brain images, which makes WPTE more suitable than WPSE in describing brains.

Classifier comparison
To compare the classification performance between SVM and FSVM. We set the features as WPSE and WPTE (q = 0.8). Then, we applied both SVM and FSVM for classification. The 10 runs of K-fold SCV results are listed below in Table 5. Results in Table 5 shows that "WPSE + FSVM" obtains accuracies of 99.85, 99.69, 98.94 % over three datasets, which are higher than those obtained by "WPSE + SVM". The similar results occur between "WPTE + FSVM" and "WPTE + SVM" in the way that the classification accuracy increases after SVM is replaced with FSVM. The reason is FSVM applies a FMF to each training data, so FSVM can reduce the influence of noises and outliers. In addition, the "WPTE + FSVM" performs the best among all four proposed approaches. It will be used as the default proposed method in following text.

Comparison with state-of-the-art
We compared the best proposed method ( We averaged the results of 10 runs of K-fold SCV. The comparison results are listed in Table 6, in which some old approaches ran five times in their papers with results extracted from literature (Das et al. 2013). This experiment ran ten times to get more robust results than a five-time run.
The value of q was again assigned with 0.8 (The reason can be found in "Optimal parameter q"). The regularization constant C were obtained via grid-search method. Table 6 shows the proposed "WPTE + FSVM" performed better than existing stateof-the-art methods, obtaining perfect classification for the first two datasets and an accuracy of 99.49 % for D-255. This demonstrated the effectiveness of FSVM, which can reduce the effect of noise and outliers in the training points, yielding a more reliable hyperplane than standard SVM. The second best classifier is "RT + PCA + LS-SVM" (Das et al. 2013) that achieved 99.39 % for D-255. Finally, the average evaluations based on 10 runs of the proposed WPTE + FSVM method were listed in Table 7. For D-66 and D-160, the WPTE + FSVM yielded perfect classification. For the D-255, its performance slightly decreased with sensitivity of 99.50 %, specificity of 99.43 %, precision of 99.91 %, and accuracy of 99.49 %.

Optimal parameter q
The parameter q influences the extracted features, so it also influences classification performance. Its value should be no more than 1, since the brain image is subextensive, containing complicated regions. In this final experiment, we varied the value of q in the set of [0.1, 0.2, 0.3, …, 0.1, 1] (Note q = 1 degrades WPTE to WPSE), and ran the offline training for each value. We recorded the average accuracy over 10 runs on the dataset D-255 by the proposed "WPTE + FSVM". The results are shown in Fig. 5 and Table 8. Figure 5 demonstrates the value of q yields slight but discernible effect on average accuracy of 10 runs. As q increases to 0.8, the curve increases gradually till the highest. As q increases to 0.1, the average accuracy decreases sharply. The result again validates that WPTE (q = 0.8) is better than WPSE (q = 1).  This optimal result (q = 0.8) in this work exactly identical to three recent literatures: Sturzbecher et al. (2009), Cabella et al. (2009. Furthermore, Diniz et al. (2010) found the fact that q = 1.5 for gray matter (GM), 0.1 for white matter (WM), and 0.2 for cerebrospinal fluid (CSF). Here we treat the whole brain as a single, so we must assign a single value to q. The optimal q of 0.8 can be regarded as an average of best q of GM, WM, and CSF.

Discussion on the proposed method
There were three causes to use WPT, TE, and FSVM. (1) WPT yields more features than DWT does. (2) Entropy can efficiently represent the complexity of subband coefficients, and TE is a better feature descriptor for brain structures than SE. (3) FSVM applies a FMF to each training data, so it can reduce the influence of noises and outliers.
The contributions of this work centered in three points: (i) We employed WPTE that offered better information description than WPSE. (ii) We employed FSVM that can deal with noises and outliers compared to plain SVM; and (iv) We proved the proposed "WPTE + FSVM" approach obtained superior average accuracy to 17 state-of-the-art approaches.

Conclusion and future research
In this study, we treated the PBD as a binary classification problem as pathological and healthy. To solve it, we proposed a novel feature WPTE, which used WPT to replace traditional DWT method and used TE to replace traditional SE method, and fed WPTE into FSVM. The experiments showed the proposed "WPTE + FSVM" method yielded superior performance to state-of-the-art methods. Future work should focus on the following four aspects: (i) we will include other imaging techniques, such as DTI, FMRI and MRSI; (ii) the classification performance may increase by using other advanced variants of SVMs, such as GEPSVM (Yu et al. 2015a) and Twin SVM (Jayadeva et al. 2007). (iii) we will check the effect produced by other wavelet family and other decomposition levels. (iv) We will try to develop finegrid search to replace the coarse-grid search technique. (v) Swarm intelligence methods ) will be employed to train the weights of classifiers.

Authors' contributions
YDZ and SHW conceived the study. YDZ and XJY designed the model. SHW and ZCD acquired the data. YDZ, GL and TFY analyzed the data. GL and PP interpreted the data. YDZ and ZCD developed the program. YDZ, SHW, and TFY wrote the draft. All authors gave critical revisions and approved the submission. All authors read and approved the final manuscript.