 Research
 Open Access
 Published:
Feature selection and semisupervised clustering using multiobjective optimization
SpringerPlus volume 3, Article number: 465 (2014)
Abstract
In this paper we have coupled feature selection problem with semisupervised clustering. Semisupervised clustering utilizes the information of unsupervised and supervised learning in order to overcome the problems related to them. But in general all the features present in the data set may not be important for clustering purpose. Thus appropriate selection of features from the set of all features is very much relevant from clustering point of view. In this paper we have solved the problem of automatic feature selection and semisupervised clustering using multiobjective optimization. A recently created simulated annealing based multiobjective optimization technique titled archived multiobjective simulated annealing (AMOSA) is used as the underlying optimization technique. Here features and cluster centers are encoded in the form of a string. We assume that for each data set for 10% data points class level information are known to us. Two internal cluster validity indices reflecting different data properties, an external cluster validity index measuring the similarity between the obtained partitioning and the true labelling for 10% data points and a measure counting the number of features present in a particular string are optimized using the search capability of AMOSA. AMOSA is utilized to detect the appropriate subset of features, appropriate number of clusters as well as the appropriate partitioning from any given data set. The effectiveness of the proposed semisupervised feature selection technique as compared to the existing techniques is shown for seven reallife data sets of varying complexities.
1 Introduction
Clustering, also termed as unsupervised learning, is the method of grouping the data items into different partitions or clusters in such a way so that points which belong to same cluster should be similar in some manner and points which belong to different clusters should be dissimilar in the same manner (Saha and Bandyopadhyay 2010). In supervised learning some training set needs to be available which captures the prior knowledge about class labels of some points. A classifier can be trained using this training set. After this step, a classifier can easily detect the class labels of unlabelled data depending on the model built. But in unsupervised classification, no prior knowledge about data points are available. Unsupervised learning classifies the data based on actual distribution of the data items and well quantified intrinsic property. In reallife it is easy to generate plenty of unlabeled data but it is hard to determine actual annotations of these data. Some human annotators are required to annotate the data. Thus training set generation is very much cost expensive and requires more time. Because of this situation unsupervised learning methods are widely used than supervised learning but the accuracies obtained by unsupervised clustering techniques are lower than supervised classification techniques. Thus it is required to develop some new classification techniques which can combine the fruitfulness of both supervised and unsupervised classification techniques. This new classification technique is named as semisupervised classification and it is developed to solve the problems associated with unsupervised and supervised classification methods. Both supervised and unsupervised classification information are used in semisupervised classification. Because of the above mentioned property, semisupervised classification is more demanding and showing good results. This has a large number of applications in natural Language processing, information retrieval, data mining, gene function classification, protein classification and bioinformatics (Handl and Knowles 2006; Li et al. 2003). In this paper we have solved some problems related to semisupervised clustering techniques.
Feature selection, or subset selection, is the method of reducing dimensionality in machine learning. It is important for different reasons: first total computation can be reduced if we can reduce the dimensionality. Secondly all the features may not be helpful to classify the data; some may be redundant and irrelevant from the classification point of view. Thus it is needed to determine automatically relevant subset of features. In order to address the above mentioned problems, feature selection is needed both for unsupervised as well as supervised classification problems. There are many works to address the feature selection problem in supervised domain (Aha and Bankert 1996; Bermejo et al. 2012; Blum and Langley 1997; Liu and Yu 2005). But there are a very few works for solving the feature selection problems from unsupervised domain. In case of unsupervised classification it is very difficult to measure the goodness of a particular feature. In recent years some works have been reported to solve the unsupervised feature selection problem (Dash and Liu 1999; Dy and Brodley 2004; Kim et al. 2002; Mitra et al. 2002; Peña et al. 2001; Talavera 1999). But most of these techniques pose the feature selection problem as a single objective optimization technique. They have mostly optimized a single cluster quality measure. In recent years there are some approaches which use multiobjective optimization to solve the unsupervised feature selection problem. A multiobjective wrapper based approach to solve the unsupervised feature selection problem is developed by Morita et al. (2003). Kmeans clustering technique is utilized as the underlying partitioning method and authors have varied the number of clusters in a range. They have utilized a multiobjective evolutionary algorithm (the Nondominated Sorting GAII, NSGAII, (Deb et al. 2002)) as the underlying optimization tool. Two objective functions related to clustering are deployed for the clustering task namely the number of features and the DaviesBouldinIndex (DBIndex, Davies and Bouldin (1979)). In 2002, Kim et al. (2002) presented a multiobjective approach for wrapper based unsupervised feature selection. A multiobjective algorithm ELSA (Evolutionary Local Selection Algorithm (Menczer et al. 2000)) is utilized as the underlying optimization technique. Authors have used Kmeans algorithm as the underlying partitioning tool to partition the given data set based on a feature combination. Both good feature subsets and the corresponding numbers of clusters are determined by ELSA. Authors have considered four different clustering objectives for optimization: a count on number of features, the number of clusters present in a chromosome, intracluster compactness of the partitioning encoded in a chromosome and intercluster separation of the partitioning encoded in a chromosome. The problem of feature selection coupled with clustering is also treated as a multiobjective optimization problem in (Julia and Knowles 2006). There a multiobjective evolutionary algorithm named Pareto envelopebased selection algorithm version 2 (PESAII) is used to develop a multiobjective feature selection technique. But this technique utilizes Euclidean distance for assigning points to different clusters. This technique utilizes Kmeans as the underlying clustering technique. Thus it can only be able to determine hyperspherical shaped equal sized clusters. In this paper we have posed the problem of feature selection for semisupervised classification as a multiobjective optimization problem. A recently created simulated annealing based multiobjective optimization technique, AMOSA (Bandyopadhyay et al. 2007), is utilized as the underlying optimization tool. Instead of using Euclidean distance, Point symmetry based distance (Bandyopadhyay and Saha 2007) is utilized for assigning points to different clusters.
The proposed multiobjective semisupervised clustering as well as feature selection technique called SemiFeaClusMOO technique encodes number of features and number of cluster centers in the form of a string. Then points are assigned based on the features present in the string to different clusters using a recently introduced point symmetry based distance (Bandyopadhyay and Saha 2007). As supervised information, we assumed that class labels of 10% data points are available to us. Four different objective functions are used for optimization. These are a symmetry based cluster validity index, Symindex (Bandyopadhyay and Saha 2007), Euclidean distance based cluster validity index, XBindex (Xie and Beni 1991), an external cluster validity index, Adjusted Rand Index (Yeung and Ruzzo 2001) which measures the similarity of the obtained partitioning on labeled data points with their original class labels, and number of features. The third objective function captures the supervised information. The fourth objective function is used to balance the bias of the first two objective functions on dimensionality. Some distance computations have to be performed in order to determine the values of internal cluster validity indices. As value of distance function decreases with the decrease of number of features, internal cluster validity indices are biased towards lower dimensions (Julia and Knowles 2006). In order to balance these bias we have used the fourth objective which will try to increase the number of features present in a data set. The final Pareto optimal front contains a set of solutions representing different feature combinations and cluster centers. The algorithm will automatically identify the appropriate number of clusters, appropriate partitioning and automatic feature combinations from a data set. Results are shown for several higher dimensional reallife data sets. The performance of SemiFeaClusMOO technique is compared with a) SemiFeaClusMOO technique using Euclidean distance in place of point symmetry based distance for assigning points to different clusters b) a multiobjective based simultaneous feature selection and unsupervised clustering technique where only three objective functions are used: Symindex, XBindex and number of features. Same architectures of AMOSA based clustering technique proposed in this paper are followed; (c) a multiobjective based automatic clustering technique where all the features are utilized for distance computation and point symmetry based distance is used for assigning points to different clusters, (d) a single objective based automatic clustering technique where all the features are used for distance computation and point symmetry based distance is used for assigning points to different clusters and e) Kmeans clustering technique with all feature combinations and known number of clusters.
2 The SA based MOO algorithm: AMOSA
Archived multiobjective simulated annealing (AMOSA) (Bandyopadhyay et al. 2007) is a newly developed simulated annealing (SA) based multiobjective optimization technique. MOO is utilized for solving different realworld problems having multiple objective functions to be optimized simultaneously. Simulated annealing (SA), based on the principles of statistical mechanics (Kirkpatrik et al. 1983), is a search tool in order to solve difficult problems related to optimization theory. SA dominates over exhaustive search procedures because of its time and resource efficiencies. SAs are widely used to solve single objective optimization problems. But there are very few attempts in solving multiple objective optimization problems using SAs. This is because of searchfromapoint nature of SA which helps it to determine a single solution after a single run. In case of multiobjective optimization, there are a set of tradeoff solutions. In order to generate all the solutions, we need to run SA multiple times. In recent years, Bandyopadhyay et al. developed an efficient multiobjective version of SA called AMOSA (Bandyopadhyay et al. 2007) which overcomes all the above stated limitations.
Several new concepts have been introduced in AMOSA (archived multiobjective simulated annealing) (Bandyopadhyay et al. 2007) which is an multiobjective version of SA. The concept of an archive is utilized in AMOSA which is used to store all the nondominated solutions seen so far. This archive is associated with two limits: a hard or strict limit denoted by HL, and a larger, soft limit denoted by SL, where S L>H L. During the SA process several nondominated solutions are generated which all are stored in the archive. In the mean while if a newly generated solution dominates some members of the archive, then those solutions are removed from the archive. During the process, if the size of the archive exceeds the specified soft limit, SL, clustering procedure is invoked to reduce the size to HL.
The AMOSA algorithm starts its execution with the initialization of a number (γ×S L,γ>1) of solutions in the archive, each of which represents a state in the search space. For each solution, multiple objective functions are computed. Each of these newly generated solutions is then finetuned by using simple hillclimbing and domination relation for a number of iterations. Thereafter nondomination sorting is applied and all the nondominated solutions are stored in the archive until the size of the archive crosses the soft limit SL. If archive crosses HL after this initialization process, single linkage clustering procedure is called in order to reduce the size of the archive to HL. A point from the archive is randomly picked up. At temperature, T=T m a x, this is considered as the currentpt, or the initial solution. The currentpt is mutated to produce a new solution called the newpt. The objective functional values of newpt are calculated. newpt is now compared with currentpt and points in the archive. A new term called amount of domination, Δ d o m(a,b) between two solutions a and b is introduced in AMOSA and it is defined as follows:\mathrm{\Delta dom}(a,b)=\prod _{i=1,{f}_{i}\left(a\right)\ne {f}_{i}\left(b\right)}^{M}\frac{{f}_{i}\left(a\right){f}_{i}\left(b\right)}{{R}_{i}}, where f_{ i }(a) and f_{ i }(b) are the i th objective values of the two solutions and R_{ i } is the corresponding range of the objective function. After comparing the domination status of the newpt, currentpt and points in the archive, different cases may arise viz., accept the (i) newpt, (ii) currentpt, or, (iii) a solution from the archive. During the process if archive size crosses SL, clustering is again invoked to reduce the size to HL. The above mentioned steps are executed iter times for each temperature which is annealed with a particular cooling rate of α(<1) until the minimum temperature Tmin is attained. Finally the procedure is completed with the final archive containing the final nondominated solutions.
Results show that performance of AMOSA is better than some other wellknown MOO algorithms especially for 3 or more objectives (Bandyopadhyay et al. 2007). Inspired by these results AMOSA is utilized in the current paper as the underlying MOO technique.
3 Proposed method of multiobjective feature selection and semisupervised clustering technique
This section describes the newly proposed multiobjective feature selection and semisupervised clustering technique, SemiFeaClusMOO, in detail.
3.1 String representation and population initialization
In SemiFeaClusMOO, a state of AMOSA is comprising of two items: a) a set of real numbers which represents the coordinates of the centers of the clusters hence the associated partitioning of the data and b) a set of binary numbers which represents different feature combinations. AMOSA tries to determine an appropriate set of cluster centers and the appropriate set of feature combinations. Suppose a particular string encodes the centers of K number of clusters and total number of available features is denoted by F. Then the length of the string will be F+K×F. The K number of cluster centers are randomly chosen K points from the data set. Here feature combinations are some randomly chosen binary numbers. An example of a string is given below as well as in Figure 1 where K =3 and F=5:<{c}_{1}^{1},{c}_{1}^{2},{c}_{1}^{3}\dots ,{c}_{1}^{5},{c}_{2}^{1},\dots ,{c}_{2}^{5},{c}_{3}^{1},{c}_{3}^{2},\dots ,{c}_{3}^{5},11010. This represents a partitioning having three cluster centers <{c}_{1}^{1},{c}_{1}^{2},{c}_{1}^{3}\dots ,{c}_{1}^{5}>, <{c}_{2}^{1},\dots ,{c}_{2}^{5}>, and <{c}_{3}^{1},{c}_{3}^{2},\dots ,{c}_{3}^{5}> and we have to only consider the features: first, second and fourth. These features are considered for cluster assignments and objective function calculations.
Initially K_{ i } number of clusters are encoded in each string, such that K_{ i }=(r a n d()mod(K^{max}1))+2. Here, r a n d() is a random number generator returning an integer, and K^{max} is the soft estimate of upper limit of number of clusters. For a particular string number of clusters encoded in it can vary between 2 to K^{max}.
For the initialization purpose, we have followed a random procedure. For a particular string i total number of cluster centers encoded in it is K_{ i }. These K_{ i } number of cluster centers are selected randomly from the entire data set. Thereafter these cluster centers are encoded in that particular string. Initial paritioning is generated randomly using minimum center distance based criterion. The feature part of each string is also initialized randomly. Suppose in the data set there are F number of features; then each position of the feature set is randomly initialized to either 0 or 1. Here value of 0 at the i th position represents that this feature will not take part in further processing and value of 1 denotes that i th feature will participate in further processing like cluster assignment, objective value computations etc.
3.2 Assignment of points
For assignment of points to different clusters, the newly developed point symmetry based distance (Bandyopadhyay and Saha 2007), {d}_{\mathit{\text{ps}}}(\overline{a},\overline{z}), is utilized. This distance is developed in (Bandyopadhyay and Saha 2007). The below description is taken from (Bandyopadhyay and Saha 2007). This distance is computed as follows. Suppose a point is denoted by\overline{a}. The symmetrical (reflected) point of\overline{a}with respect to a particular center\overline{z}is2\times \overline{z}\overline{a}. Let us denote this by{\overline{a}}^{\ast}. Let knear unique nearest neighbors of{\overline{a}}^{\ast}be at Euclidean distances of d_{ i }s, i=1,2,…knear. Then
where{d}_{e}(\overline{a},\overline{z})is the Euclidean distance between the point\overline{a}and\overline{z}, and{d}_{\mathit{\text{sym}}}(\overline{a},\overline{z})is a symmetry measure of\overline{a}with respect to\overline{z}and is defined as\frac{\sum _{i=1}^{\mathit{\text{knear}}}{d}_{i}}{\mathit{\text{knear}}}. Here knear is chosen equal to 2. The properties of{d}_{\mathit{\text{ps}}}(\overline{a},\overline{z})are thoroughly described in ( Bandyopadhyay and Saha 2008).
In order to assign points to different clusters, each center encoded in a string is considered as a separate cluster. Suppose K number of clusters and f number of features are encoded in a particular string. Here cluster k is selected for assignment of point {\overline{a}}_{i}, 1≤i≤n, iff {d}_{\mathit{\text{ps}}}({\overline{a}}_{i},{\overline{z}}_{k})\le {d}_{\mathit{\text{ps}}}({\overline{a}}_{i},{\overline{z}}_{j}),\phantom{\rule{1em}{0ex}}\phantom{\rule{1em}{0ex}}j=1,\dots ,K,\phantom{\rule{1em}{0ex}}j\ne k\phantom{\rule{1em}{0ex}}, and {d}_{\mathit{\text{sym}}}({\overline{a}}_{i},{\overline{z}}_{k})=\left({d}_{\mathit{\text{ps}}}\right({\overline{a}}_{i},{\overline{z}}_{k})/{d}_{e}({\overline{a}}_{i},{\overline{z}}_{k}\left)\right)\le \theta. For \left({d}_{\mathit{\text{ps}}}\right({\overline{a}}_{i},{\overline{z}}_{k})/{d}_{e}({\overline{a}}_{i},{\overline{z}}_{k}\left)\right)>\theta, point {\overline{a}}_{i} is assigned to some cluster m iff {d}_{e}({\overline{a}}_{i},{\overline{z}}_{m})\le {d}_{e}({\overline{a}}_{i},{\overline{z}}_{j}),\phantom{\rule{1em}{0ex}}j=1,2\dots K,\phantom{\rule{1em}{0ex}}j\ne m. In other words, point {\overline{a}}_{i} is assigned to that particular cluster with respect to whose center its point symmetry based distance, d_{ ps } (Bandyopadhyay and Saha 2007), is minimum, provided the corresponding d_{ sym } value is less than some threshold θ. Otherwise, cluster assignment is based on minimum Euclidean distance criterion as normally done in (Bandyopadhyay and Maulik 2002) or the Kmeans algorithm. The possible reasons for assignment of points to different clusters using the above mentioned way are as follows: In the intermediate stages of the algorithm, when the centers are not set to the proper values, the minimum d_{ ps } value for a point is expected to be very large as the point might not be symmetric with respect to any center. In such cases, we have used Euclidean distance for cluster assignment. In contrast, when cluster centers are set to the appropriate values, d_{ ps } values become reasonably small and cluster assignment based on point symmetry based distance is more meaningful. For the point symmetry based distance computation, d_{ ps } (Bandyopadhyay and Saha 2007) and Euclidean distance computation we have considered only those features which are present in that particular string.
As explained in Ref (Bandyopadhyay and Saha 2007), the value of θ is set equal to the maximum nearest neighbor distance among all the points in the data set.
3.3 Objective functions used
For the purpose of optimization, three different cluster validity indices and another objective function counting the number of features present in the data set are considered. The first two objective functions are two internal cluster validity indices which reflect different properties of good clustering solutions. The first measures the amount of symmetry present in a particular partitioning. The second one measures the goodness of the clusters in terms of the Euclidean distance. The third cluster validity index is an external cluster validity index named Adjusted Rand Index (Yeung and Ruzzo 2001). The fourth objective function measures the number of features present in a particular string.
The first objective function is the point symmetry distance based cluster validity index, Symindex (Bandyopadhyay and Saha 2008). This validity index quantifies the goodness of a partitioning in terms of symmetricity. It is able to identify symmetrical shaped well separated clusters. Second cluster validity index, XBindex (Xie and Beni 1991), is based on Euclidean distance. It is able to identify compact wellseparated clusters. Here compactness is measured in terms of Euclidean distance. In order to compute the Symindex and XBindex values corresponding to a particular string we have utilized those features which are present in that particular string.
In order to compute the similarity between the obtained partitioning and true class labels of the 10% data points for which original class label information are known, we have used an external cluster validity index as the third objective function. The use of this index provides a way of incorporating the labeled information in the unsupervised clustering. As the external cluster validity index, we have used Adjusted rand index (ARI) (Yeung and Ruzzo 2001). This is the third objective function which captures the matching between observed solution and prior knowledge based true solution for the 10% data points. ARI is computed over only these 10% data points for which class information are known. When two partitions agree completely then adjusted rand index value is 1. Thus higher values of adjusted rand index are better.
The fourth objective function is the number of features encoded in a particular string. We have to calculate the number of features present in a particular string and have to maximize the value of number of features. f_{3}=maximize∥f∥ where ∥f∥=number of features present in that particular string.
The above four cluster validity indices are computed for each string. For Symindex and XBindex computations we have to consider only those features which are present in that particular string. ARI index value is calculated only for 10% data for which labeled information are known. Thus the objective functions for a particular string are as follows:
where S y m(K,f), X B(K,f), A d j(K), and ∥f∥ denote respectively, the calculated Symindex value, XBindex value, Adjusted Rand Index value and number of features present in that particular string. Here number of clusters present in a particular string is denoted by K and f is the number of features present in that particular string. The search capability of newly developed simulated annealing based MOO algorithm, AMOSA is utilized to maximize simultaneously above mentioned four objective functions.
3.4 Mutation operation
Here mutation operations are applied on a particular string to generate a new string. In order to change the cluster centers encoded in a particular string three different types of mutation operations are applied. Binary mutation is applied to change the feature combinations present in a particular string. Here each bit position of the feature vector is flipped with some probability (if initially there was 1 it is now replaced by 0; if initially there was 0 it is now replaced by 1).

1.
In order to mutate each individual cluster center, we have used Laplacian distribution, p\left(\epsilon \right)\propto {e}^{\frac{\epsilon \mu }{\delta}}, where the scaling factor δ is used to set the magnitude of mutation to generate a new value for that particular position. Here μ represents the value at the position which is to be mutated. We have kept scaling factor δ equals to 1.0. The newly generated value is used to replace the old value. This perturbation operation is applied to all dimensions independently. In order to change the feature combinations binary mutation is applied.

2.
In order to reduce the number of clusters encoded in a string by 1, a cluster center is removed from the string. In this case again to change the feature combinations binary mutation is used.

3.
In order to increase the number of clusters encoded in a string by 1, we have to add an arbitrarily chosen data point in the string. Here the cluster center to be added is a randomly chosen data point from the entire data set. In this case again to change the feature combinations binary mutation is used.
For a particular string, if it is chosen for mutation, any of the above defined types of mutation operation is applied with uniform probability.
3.5 Selection of the best solution
After application of any MOO based technique we used to get a set of nondominated solutions (Deb 2001) on the final Pareto optimal front. Each of these solutions provides some feature combinations and cluster centers. Based on these information using the point symmetry based distance (Bandyopadhyay and Saha 2007) we can get the partitioning associated with this solution. All the solutions are equally important from execution point of view. But sometimes depending on user requirement we may need to select a single solution. In this paper we have used a semisupervised approach to select a single solution.
In the proposed semisupervised feature selection algorithm we have assumed that class labels of 10% data points are known to us. The proposed SemiFeaClustMOO generates a set of Pareto optimal solutions on the final archive. Based on the cluster centers and feature combinations present in these solutions we used to assign the cluster labels of these 10% labeled points using the nearest center criterion. The similarity score of these assigned class labels and the original class labels is measured using an external cluster validity index, named Minkowski Score (Jiang et al. 2004). Minkowski Score quantifies the quality of a solution provided the true clustering (Jiang et al. 2004). Suppose T denotes the “true” clustering solution and S denotes the partitioning result which we wish to measure. Let the number of pairs of elements that are in the same cluster in both S and T is denoted by n_{11}, the number of pairs that are in the same cluster only in S is denoted by n_{01}, and the number of pairs that are in the same cluster in T but in different clusters in S is denoted by n_{10}. Minkowski Score is then calculated as below:
Here optimum value is 0. Lower values of MS are preferred.
The solution which attains the minimum Minkowski Score value calculated over the labeled points is selected as the best solution.
4 Experimental results
In this section we have discussed about the results obtained after application of SemiFeaClusMOO technique on some reallife data sets.
4.1 Data sets used
Seven reallife data sets obtained from http://www.ics.uci.edu/~mlearn/MLRepository.html are used for the experiments. The detailed description of the reallife data sets in terms of the total number of points present, dimension of the data set and the number of clusters are presented in Table 1.

1.
Iris: This data set is having 150 data points distributed over 3 clusters. Here each cluster is having 50 points. This data set corresponds to different types of irises having four feature values (Fisher 1936). There are three classes in the data set. These are Setosa, Versicolor and Virginica. Among these three classes, two classes (Versicolor and Virginica) are overlapping to each other, while the third one is linearly separable from the other two.

2.
Cancer: This data set represents Wisconsin Breast Cancer data set. There are total 683 sample points, each having nine features. The features correspond to clump thickness, cell size uniformity, cell shape uniformity, marginal adhesion, single epithelial cell size, bare nuclei, bland chromatin, normal nucleoli and mitoses. The data set is having two classes : malignant and benign. These two classes are known to be well separated from each other.

3.
Newthyroid: This data set corresponds to Thyroid gland data. There are total three classes present in the data. These are: euthyroidism, hypothyroidism and hyperthyroidism. Each sample represents values of five laboratory tests which are conducted to predict whether a patient’s thyroid belongs to any of the three classes. This data set is having total 215 samples.

4.
Wine: This data set corresponds to Wine recognition data. It is having total 178 instances, each is having 13 features. These features correspond to different chemical analysis of wines. The samples are grown in the same region in Italy but derived from three different families. There are three classes present in the data set. The Chemical analysis determines the magnitudes of 13 constituents found in each of the three types of wines.

5.
LiverDisorder: This data set corresponds to liver disorder data. There are total 345 instances in the data set. Each instance is having 6 features. There are two output classes for this data set.

6.
LungCancer: This data is having 32 samples; each having 56 features. This data represents different samples of pathological lung cancers where there are three different categories.

7.
Glass: This data corresponds to glass identification data. It is consisting of 214 samples where each sample is having 9 features (an Id# feature has been removed). From the criminological investigation point of view, this study of the classification of the types of glass is important. At the scene of the crime, the broken glasses can be utilized as some evidences if these are identified correctly. The data set is having total 6 classes present.
5 Discussion of results
In SemiFeaClusMOO, the newly developed simulated annealing based MOO technique, AMOSA, is used as the underlying optimization technique for simultaneous feature selection and semisupervised clustering. The parameters of the proposed SemiFeaClusMOO clustering technique are as follows: SL = 100 HL = 50, iter = 50, Tmax = 100, Tmin = 0.00001 and cooling rate, α=0.9. The performance of SemiFeaClusMOO technique is compared with a) SemiFeaClusMOO using Euclidean distance in place of point symmetry based distance for assignment of points to different clusters b) FeaClusMOO: a multiobjective based simultaneous feature selection and clustering technique which uses AMOSA as the underlying optimization strategy. Here three objective functions are optimized: Symindex, XBindex and number of features. No external cluster validity index is considered. (c) VAMOSA (Saha and Bandyopadhyay 2010): a point symmetry based multiobjective automatic clustering technique which can tackle only the clustering problem using multiobjective optimization. Here two cluster validity indices are considered for optimization: Symindex and XBindex. Here all the features are utilized for distance computation and point symmetry based distance is used for assignment of points to different clusters. (d) a point symmetry based automatic clustering technique utilizing the search capability of GAs, VGAPSclustering (Bandyopadhyay and Saha 2008) where all the features are utilized for distance computation and point symmetry based distance is utilized for assignment of points to different clusters c) traditional Kmeans clustering technique with all the features utilized for distance computation. Table 2 reports the number of clusters and the number of features automatically determined by the proposed SemiFeaClusMOO technique using point symmetry based distance, SemiFeaClusMOO technique using Euclidean distance in place of point symmetry based distance, FeaClusMOO technique for feature selection and unsupervised clustering, VAMOSA technique for only unsupervised clustering, and VGAPS clustering technique (Bandyopadhyay and Saha 2008) for all the above mentioned data sets. The Minkowski Score values of the final clusterings identified by these six algorithms are also shown in Table 2. Minkowski Score is an external cluster validity index which measures the goodness of an obtained partitioning. Comparison of our proposed technique is also carried out with a traditional clustering technique, Kmeans. This algorithm is executed on all data sets with actual number of clusters and with all the available features. The final Minkowski Score values are reported in Table 2.
In SemiFeaClusMOO _{ Sym }, SemiFeaClusMOO _{ Euc }, FeaClusMOO technique, VAMOSA technique, AMOSA is used as the base optimization tool. Thus same parameter setting is utilized in each of these cases. The parameters are as follows: SL = 100 HL = 50, iter = 50, Tmax = 100, Tmin = 0.00001 and cooling rate, α=0.9.
The parameter values of VGAPS clustering are kept as follows: population size is set to 100, number of generations is set to 60 (if the algorithm is executed for more number of generations, no performance improvement is observed). Adaptive mutation and crossover probabilities are used in case of VGAPS. Note that we have set the parameters in such a way that all the algorithms are executed for equal number of function evaluations. Total function evaluations computed by AMOSA based approaches are equal to the total function evaluations computed by VGAPS and Kmeans clustering techniques. Here each of the above mentioned algorithms are applied ten times on each data set and Table 2 reports the best results out of these ten different runs.
In order show the efficacy of the proposed feature selection and clustering technique, SemiFeaClusMOO clustering, we have used several reallife data sets. These reallife data sets are higher dimensional in nature; thus these are suitable for the application of some feature selection technique. For these reallife data sets, we can not demonstrate the results visually as these are higher dimensional data sets. Here we have reported the best Minkowski Scores obtained by all the algorithms over 10 different executions for all data sets. For Iris data set SemiFeaClusMOO clustering technique selects only a single feature out of total four features. It is able to identify the appropriate partitioning (K=3) with this single feature. The Minkowski Score value of this partitioning is the lowest compared to other five techniques (refer to Table 2). SemiFeaClusMOO with Euclidean distance selects three features out of total four features on its final solution. It is not also able to correctly identify appropriate number of clusters. The corresponding MS value is larger than that obtained by SemiFeaClusMOO with point symmetry based distance. This shows the utility of point symmetry based distance for assigning points to different clusters. FeaClusMOO technique is also able to determine the appropriate number of clusters. It identifies two features out of total 4 features. But the corresponding MS value is higher compared to the proposed feature selection as well as semisupervised clustering technique. VAMOSA and VGAPSclustering are applied with all features on the given data set. They attain the MS values of 0.80 and 0.62, respectively. While VGAPS is able to identify appropriate number of clusters from this data set automatically, VAMOSA fails to do so. The MS values obtained by VAMOSA and VGAPS clustering techniques are, respectively, 0.23 and 0.41 points higher than that attained by the proposed SemiFeaClusMOO clustering technique. As minimum MS value means better partitioning, it proves the utility of feature selection. Kmeans performs poorly for this data set. It attains MS value of 0.68. Cancer data set is having total nine features. Out of these nine features, SemiFeaClusMOO technique selects total 5 features on the optimal solution. The corresponding partitioning is having two clusters with the minimum MS value (refer to Table 2). SemiFeaClusMOO _{ Euc } technique again selects 5 features on its optimal solution. It is able to identify the correct number of clusters from this data set. But the MS value obtained by this algorithm is higher than MS value obtained by SemiFeaClusMOO technique. This again proves the utility of assigning points to different clusters based on the point symmetry based distance. FeaClusMOO technique performs similarly as SemiFeaClusMOO. It also attains a minimum MS value of 0.31. It selects total five features and two clusters on the optimal partitioning. This result proves that for this data set no improvement is observed after utilizing the labeled information (refer to Table 2). VAMOSA and VGAPS clustering techniques are applied on this data set with all the features. Both the algorithms identify correct number of clusters from this data set. VAMOSA attains MS value of 0.32 which is slightly higher than the MS value obtained by SemiFeaClusMOO technique (refer to Table 2). The MS values obtained by VGAPS clustering and Kmeans clustering are the same (refer to Table 2).
For Newthyroid data, proposed SemiFeaClusMOO clustering technique selects 2 features out of total 5 features on its final solution. It is able to identify the appropriate number of clusters from this data set. The MS value attained by this clustering technique is also the minimum (refer to Table 2). SemiFeaClusMOO _{ Euc } identifies three features on its final solution. It is also able to identify the correct number of clusters. The MS value attained by this algorithm is slightly higher than the MS value obtained by SemiFeaClusMOO clustering technique. FeaClusMOO clustering technique selects 4 features out of total 5 features. But it is also able to identify the appropriate number of clusters from this data set. The corresponding MS value is higher compared to that obtained by the proposed SemiFeaClusMOO clustering technique (refer to Table 2). This proves the utility of using semisupervised clustering technique. VAMOSA clustering technique selects total 5 clusters from this data set when applied with all the available features. The corresponding MS score is 0.57 which is 0.11 points higher than that obtained by SemiFeaClusMOO technique. VGAPS clustering technique with all the features again also identifies the correct number of clusters for this data set. But the MS value obtained by this clustering technique is higher compared to that obtained by SemiFeaClusMOO clustering technique. Kmeans clustering technique performs poorly for this data set. Results on this data set again proves the utilization of feature selection.
For wine data, proposed SemiFeaClusMOO clustering technique is able to detect proper number of clusters (K=3). It selects total 6 features out of 13 features. The corresponding MS value is also the lowest among all the clustering techniques (refer to Table 2). SemiFeaClusMOO _{ Euc } clustering technique with Euclidean distance is not able to detect the appropriate number of clusters from this data set. It identifies total 5 clusters from this data set. The MS value attained by this technique is slightly higher than that obtained by SemiFeaClusMOO technique. This again proves the utility of using point symmetry based distance. FeaClusMOO technique identifies three features on the final solution. It is able to correctly detect the number of clusters from this data set. But the MS score obtained by this algorithm is slightly higher than that obtained by SemiFeaClusMOO clustering technique. This again shows the utility of using semisupervised information. VAMOSA clustering technique when applied with all the features is able to detect the correct number of clusters but it attains a very high MS value. VGAPS clustering technique when applied with all the features is able to identify appropriate number of clusters. The corresponding MS value is 1.12 which is 0.50 points higher than MS score obtained by the proposed SemiFeaClusMOO clustering technique. This shows the utility of feature selection. Kmeans clustering technique performs poorly for this data set. Results on this data set reveal that Euclidean distance based clustering techniques are not suitable for this data set.
For LiverDisorder data set, SemiFeaClusMOO technique performs the best as compared to other techniques (refer to Table 2). It selects five features out of total six features and identifies appropriate number of clusters from this data set. The MS value obtained by this technique is the lowest among all. SemiFeaClusMOO _{ Euc } clustering technique selects four features on its final solution. It determines three clusters from the data set. The MS score attained by this technique is 0.34 times higher than MS value obtained by SemiFeaClusMOO technique. This again proves the utility of using point symmetry based distance. FeaClusMOO also determines three features on the final solution. The final partitioning identified by this algorithm has K=2 number of clusters. The MS value attained by this technique is also much higher than that obtained by SemiFeaClusMOO technique. This proves that the use of labeled information helps to improve the performance of the proposed semisupervised technique. VAMOSA and VGAPS perform similarly for this data set. They have applied with all the available feature sets and identified similar MS scores. Kmeans also performs similarly.
For LungCancer data set, again the proposed clustering technique SemiFeaClusMOO attains the minimum MS value (refer to Table 2). Appropriate number of clusters is also detected by this algorithm from this data set. FeaClusMOO technique performs similar to SemiFeaClusMOO. SemiFeaClusMOO clustering technique with Euclidean distance fails to identify appropriate number of clusters. VAMOSA, VGAPS and Kmeans clustering techniques fail for this particular data set. Results on this data set again reveal the utility of feature selection.
For Glass data set, proposed SemiFeaClusMOO attains the minimum MS value (refer to Table 2). It is also able to identify appropriate number of clusters from this data set. It selects 7 features out of total 9 features. SemiFeaClusMOO clustering technique with Euclidean distance fails to identify appropriate number of clusters. It identifies only two features out of total 9 features on the final solution. The MS value attained by SemiFeaClusMOO clustering technique with Euclidean distance is slightly higher than that obtained by FeaClusMOO clustering technique with point symmetry based distance (refer to Table 2). FeaClusMOO technique identifies five features on the final solution. It correctly identifies appropriate number of clusters from the data set. But it attains some slightly higher values of MS score as compared to SemiFeaClusMOO clustering technique. VAMOSA clustering technique when applied with all the available features on this data set attains MS value of 1.08 which is 0.05 points higher than SemiFeaClusMOO clustering technique. It is able to identify correct number of clusters from this data set. VGAPS clustering technique when applied with all the available features on this data set is able to identify appropriate number of clusters. It also attains slightly higher values of MS score as compared to SemiFeaClusMOO clustering technique. Kmeans clustering technique performs poorly for this data set. Results on this data set again reveal the utility of point symmetry based distance for data clustering.
5.1 Summary of results
Results on a wide variety of data sets show that the proposed feature selection and semisupervised clustering technique SemiFeaClusMOO is able to detect the appropriate feature combination, appropriate number of clusters and the appropriate partitioning from data sets having many different types of clusters. Use of point symmetry based distance enables the proposed algorithm to identify various symmetrical shaped clusters (hyperspheres, linear, ellipsoidal, ring shaped, etc.) having overlaps. Use of 10% labeled information helps SemiFeaClusMOO to improve the performance of clustering. The proposed technique provides a way of incorporating some supervised knowledge in the unsupervised clustering problem. It combines two problems : feature selection and semisupervised clustering. Results on reallife data sets show that SemiFeaClusMOO is capable to detect partitioning from reallife data sets of varying characteristics. The results on seven reallife data sets establish the fact that SemiFeaClusMOO is wellsuited to detect clusters of widely varying characteristics. Results show that while SemiFeaClusMOO with Euclidean distance is only able to detect hyperspherical shaped clusters well, VAMOSA and VGAPS are capable of doing so for symmetrical shaped clusters. FeaClusMOO technique does clustering and feature selection simultaneously. It is not capable of handling any supervised information. The proposed SemiFeaClusMOO clustering technique is able to find out the proper clustering automatically where SemiFeaClusMOO with Euclidean distance succeeds while FeaClusMOO/VAMOSA/VGAPS fails as well as where FeaClusMOO/VAMOSA/VGAPS succeeds while SemiFeaClusMOO with Euclidean distance fails. Results also reveal the effectiveness of feature selection from the reallife data sets. The feature selection step of SemiFeaClusMOO often helps it to perform better than VAMOSA/VGAPS clustering techniques which are also based on point symmetry based distance. Similarly use of labeled information helps SemiFeaClusMOO to perform better than FeaClusMOO technique. Results show Kmeans in general fails to detect proper partitioning from reallife data sets.
The improved performance of SemiFeaClusMOO can be attributed to the following facts. Use of labeled information helps to improve the clustering result. Use of point symmetry based distance helps it to detect clusters having symmetrical shapes. The symmetry based cluster validity index measures the total symmetry present in the obtained partitioning. The proposed algorithm is also capable of identifying the appropriate feature combinations. Finally, AMOSA, the underlying optimization technique makes it capable of optimizing four objective functions simultaneously.
5.2 Statistical test
We have executed some statistical tests guided by Demšar (2006) to prove the superiority of the proposed clustering technique, SemiFeaClusMOO. Friedman statistical test (Friedman 1937) is conducted to check whether the six clustering techniques, SemiFeaClusMOO, SemiFeaClusMOO with Euclidean distance, FeaClusMOO, VAMOSA, VGAPS and Kmeans used here for experiment perform similarly or not. Some ranks are assigned by this test to each algorithm for each data set. It checks whether the calculated average ranks are significantly different from the average/mean rank. Friedman test on the above mentioned algorithms concludes that calculated average ranks and mean rank are different with a p value of 0.0166. The rankings of different algorithms are shown in Table 3. Thereafter we have conducted Nemenyi’s test (Nemenyi 1963) to compare the clustering techniques pairwise. For each of the cases, α=0.05 is kept. Results reveal that for all the cases, we have to reject the null hypotheses (the pairing algorithms perform similarly) as the corresponding p values are smaller than the α.
6 Conclusion
In this paper the problem of simultaneous feature selection and semisupervised clustering is formulated as a multiobjective optimization problem. Semisupervised clustering is a new domain where concepts of supervised classification and unsupervised classification have been combined. It utilizes few amount of labeled data and a large amount of unlabeled data. In this paper we have solved the problem of semisupervised clustering using multiobjective optimization. We have proposed a new way of utilizing the labeled information for solving the unsupervised classification problem. For clustering all the features of a data set may not be important. Thus we have firstly selected some features and then performed semisupervised clustering based on these features. A new multiobjective (MO) clustering technique SemiFeaClusMOO is proposed which can automatically (a) identify appropriate set of features from the data set, (b) partition the data into an appropriate number of clusters and (c) utilize the labeled information. Here cluster centers and feature combinations are encoded in the form of a string. Assignment of points to different clusters is done with the use of available set of features based on the point symmetry based distance. Four objective functions, one measuring the total compactness of the partitioning based on the Euclidean distance, the second one measuring the total symmetry of the clusters, third one measuring the similarity between the available labeled information and obtained partitioning, and the last one counting the number of features, are considered here. These objective functions are optimized simultaneously using the search capability of AMOSA, a newly developed simulated annealing based multiobjective optimization method, in order to determine the appropriate feature combinations, appropriate number of clusters as well as the appropriate partitioning. The performance of the proposed algorithm named SemiFeaClusMOO is compared with the Euclidean distance based version of the same algorithm, FeaClusMOO technique which tackles feature selection problem under unsupervised classification framework, a multiobjective based automatic clustering technique, VAMOSA, a single objective clustering technique, VGAPS, and one traditional clustering technique, Kmeans clustering, for several data sets having different characteristics. Results reveal that the proposed semisupervised feature selection technique is capable to detect the appropriate feature combinations and appropriate partitioning from data sets having the point symmetric clusters.
In future we would like to explore some more objective functions. We would like to test our approach more extensively. In order to select a single solution from the final Pareto optimal front we have developed a semisupervised approach. In future we would like to develop some more techniques for this purpose.
References
Aha DW, Bankert RL: A comparative evaluation of sequential feature selection algorithms. In Learning from data: artificial intelligence and statistics V, Lecture Notes in Statistics, chap. 4. Edited by: Fisher DH, Lenz HJ. New York: SpringerVerlag; 1996: 199206.http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.56.5849
Bandyopadhyay S, Maulik U: Genetic clustering for automatic evolution of clusters and application to image classification. Pattern Recogn 2002, 35(2):11971208.
Bandyopadhyay S, Saha S: GAPS: a clustering method using a new point symmetry based distance measure. Pattern Recogn 2007, 40: 34303451. 10.1016/j.patcog.2007.03.026
Bandyopadhyay S, Saha S: A point symmetry based clustering technique for automatic evolution of clusters. IEEE Trans Knowl Data Eng 2008, 20(11):117.
Bandyopadhyay S, Saha S, Maulik U, Deb K: A simulated annealing based multiobjective optimization algorithm: AMOSA. IEEE Trans Evol Comput 2007, 12(3):269283.
Bermejo P, de la Ossa L, Gámez JA, Puerta JM: Fast wrapper feature subset selection in highdimensional datasets by means of filter reranking. KnowBased Syst 2012 25(1):3544. doi:10.1016/j.knosys.2011.01.015.http://dx.doi.org/10.1016/j.knosys.2011.01.015 10.1016/j.knosys.2011.01.015
Blum AL, Langley P: Selection of relevant features and examples in machine learning. 97 1997, 1–2: 245271. doi:10.1016/S00043702(97)000635. 5http://dx.doi.org/10.1016/S00043702(97)00063 doi:10.1016/S00043702(97)000635. 5
Dash M, Liu H: Handling large unsupervised data via dimensionality reduction. 1999 ACM SIGMOD workshop on research issues in data mining and knowledge discovery 1999. http://dblp.unitrier.de/db/conf/dmkd/dmkd1999.html#DashL99
Davies DL, Bouldin DW: A cluster separation measure. IEEE Trans Pattern Anal Mach Intell 1979, 1: 224227.
Deb K: Multiobjective optimization using evolutionary algorithms. John Wiley and Sons, Ltd, England; 2001.
Deb K, Pratap A, Agarwal S, Meyarivan T: A fast and elitist multiobjective genetic algorithm: NSGAII. IEEE Trans Evol Comput 2002, 6(2):182197. 10.1109/4235.996017
Demšar J: Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006, 7: 130.http://dl.acm.org/citation.cfm?id=1248547.1248548
Dy JG, Brodley CE: Feature selection for unsupervised learning. J Mach Learn Res 2004, 5: 845889.http://dl.acm.org/citation.cfm?id=1005332.1016787
Fisher RA: The use of multiple measurements in taxonomic problems. Ann Eugen 1936, 3: 179188.
Friedman M: The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 1937, 32(200):675701.http://www.jstor.org/stable/2279372 10.1080/01621459.1937.10503522
Handl J, Knowles J: On semisupervised clustering via multiobjective optimization. In Proceedings of the 8th annual conference on genetic and evolutionary computation, GECCO ‘06. New York: ACM; 2006:14651472. doi:10.1145/1143997.1144238.http://doi.acm.org/10.1145/1143997.1144238
Jiang D, Tang C, Zhang A: Cluster analysis for gene expression data: a survey. IEEE Trans Knowl Data Eng 2004, 16(11):13701386. doi:10.1109/TKDE.2004.68.http://dx.doi.org/10.1109/TKDE.2004.68 10.1109/TKDE.2004.68
Julia H, Knowles J: Feature subset selection in unsupervised learning via multiobjective optimization. Int J Comput Intell Res 2006, 2(3):217234.
Kim Y, Street WN, Menczer F: Evolutionary model selection in unsupervised learning. Intell Data Anal 2002, 6(6):531556.http://portal.acm.org/citation.cfm?id=1293931
Kirkpatrik S, Gelatt C, Vecchi M: Optimization by simulated annealing. 220 1983, 671680.
Li T, Zhu S, Li Q, Ogihara M: Gene functional classification by semisupervised learning from heterogeneous data. In Proceedings of the 2003 ACM symposium on Applied computing, SAC ‘03. ACM. New York; 2003:7882. doi:10.1145/952532.952552.http://doi.acm.org/10.1145/952532.952552
Liu H, Yu L: Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl. Data Eng 2005, 17(4):491502. doi:10.1109/TKDE.2005.66.http://dx.doi.org/10.1109/TKDE.2005.66
Menczer F, Degeratu M, Street WN: Efficient and scalable pareto optimization by evolutionary local selection algorithms. Evol Comput 2000, 8(2):223247. doi:10.1162/106365600568185.http://dx.doi.org/10.1162/106365600568185 10.1162/106365600568185
Mitra P, Member S, Murthy CA, Pal SK: Unsupervised feature selection using feature similarity. IEEE Trans Pattern Anal Mach Intell 2002, 24: 301312. 10.1109/34.990133
Morita M, Sabourin R, Bortolozzi F, Suen CY: Unsupervised feature selection using multiobjective genetic algorithms for handwritten word recognition. In Proceedings of the seventh international conference on document analysis and recognition  Volume 2, ICDAR ‘03. Washington, DC: IEEE Computer Society; 2003:666666.http://dl.acm.org/citation.cfm?id=938980.939537
Nemenyi P: Distributionfree multiple comparisons. Ph.D. thesis 1963.
Peña JM, Lozano JA, Larrañaga P, Inza In: Dimensionality reduction in unsupervised learning of conditional gaussian networks. IEEE Trans Pattern Anal Mach Intell 2001, 23(6):590603. doi:10.1109/34.927460.http://dx.doi.org/10.1109/34.927460 10.1109/34.927460
Saha S, Bandyopadhyay S: A new symmetry based multiobjective clustering technique for automatic evolution of clusters. Pattern Recogn 2010, 43(3):738751. 10.1016/j.patcog.2009.07.004
Talavera L: Feature selection as a preprocessing step for hierarchical clustering. In Proceedings of the sixteenth international conference on machine learning, ICML ‘99. San Francisco: Morgan Kaufmann Publishers Inc.; 1999:389397.http://dl.acm.org/citation.cfm?id=645528.657652
Xie XL, Beni G: A validity measure for fuzzy clustering. IEEE Trans Pattern Anal Mach Intell 1991, 13: 841847. 10.1109/34.85677
Yeung KY, Ruzzo WL: An empirical study on principal component analysis for clustering gene expression data. Bioinformatics 2001, 17(9):763774. 10.1093/bioinformatics/17.9.763
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
SS has conceived the idea. SS, AE and RS have developed the proposed feature selection and semisupervised clustering technique. AL has executed the algorithm on all data sets and analyzed the results. SS and AE both participated in the writing the manuscript. All the authors read and approved the final manuscript.
Asif Ekbal, Abhay Kumar Alok and Rachamadugu Spandana contributed equally to this work.
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Saha, S., Ekbal, A., Alok, A.K. et al. Feature selection and semisupervised clustering using multiobjective optimization. SpringerPlus 3, 465 (2014). https://doi.org/10.1186/219318013465
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/219318013465
Keywords
 Clustering
 Multiobjective optimization (MOO)
 Symmetry
 Cluster validity indices
 Semisupervised clustering
 Feature selection
 Multicenter
 Automatic determination of number of clusters