An alternative data filling approach for prediction of missing data in soft sets (ADFIS)
- Muhammad Sadiq Khan^{1}Email authorView ORCID ID profile,
- Mohammed Ali Al-Garadi^{1},
- Ainuddin Wahid Abdul Wahab^{1} and
- Tutut Herawan^{1}
Received: 13 January 2016
Accepted: 8 July 2016
Published: 15 August 2016
Abstract
Soft set theory is a mathematical approach that provides solution for dealing with uncertain data. As a standard soft set, it can be represented as a Boolean-valued information system, and hence it has been used in hundreds of useful applications. Meanwhile, these applications become worthless if the Boolean information system contains missing data due to error, security or mishandling. Few researches exist that focused on handling partially incomplete soft set and none of them has high accuracy rate in prediction performance of handling missing data. It is shown that the data filling approach for incomplete soft set (DFIS) has the best performance among all previous approaches. However, in reviewing DFIS, accuracy is still its main problem. In this paper, we propose an alternative data filling approach for prediction of missing data in soft sets, namely ADFIS. The novelty of ADFIS is that, unlike the previous approach that used probability, we focus more on reliability of association among parameters in soft set. Experimental results on small, 04 UCI benchmark data and causality workbench lung cancer (LUCAP2) data shows that ADFIS performs better accuracy as compared to DFIS.
Keywords
Background
Soft set theory proposed by Molodtsov is considered as a mathematical model for dealing with vague and uncertain data (Molodtsov 1999). This theory is a standard as compare to existing theories such as fuzzy set, rough set, vague set and statistical approach for dealing with vague data because of its adequate of parameterization. Research in the soft set theory both theoretical and practical has been attracted many attentions, especially in the field of decision making. The first attempt in soft set decision making is introduced by Maji et al. (2002). They presented soft set first application in decision making by representing it in Boolean table and defined its reduct set. Their work of reduct was improved by Chen et al., further improved by Kong et al. and sequentially by Ma et al. for decision making of sub-optimal choices and simplified approaches, respectively (Chen et al. 2005; Kong et al. 2008; Ma et al. 2011). In parallel to these developments, researchers used soft set for handling daily life’s uncertain data issues and applied it in verity of useful applications (Cagman and Enginoglu 2012; Cagman et al. 2011; Çelik and Yamak 2013; Herawan and Deris 2011; Jun et al. 2009; Jun and Park 2008; Kalaichelvi and Malini 2011; Kalayathankal and Singh 2010; Tanay and Kandemir 2011; Xiao et al. 2009; Yuksel et al. 2013). But in some applications, researchers faced problem of incomplete soft set cases with partially missing values. Soft and its related sets data can be missed due to many factors such as improper entry, viral attack, security reasons and errors during data transfer. Incomplete soft sets can be no longer applied in any application or may yield extra-large, very small, unexpected and misleading results, if still applied. Such results, especially a wrong decision making can cause a huge loss to an individual or organizations. For coping with this situation, Zou et al. presented their techniques of weighted-average for calculating decision values and average probability for prediction of missing values in soft set and fuzzy soft set respectively (Zou and Xiao 2008). Qin et al. proposed DFIS where it indicated that data prediction in incomplete soft set is more reliable and accurate if recalculated through association between parameters and they used simple probability for cases having zero or weak association (Qin et al. 2012). Rose et al. also contributed in completion of incomplete soft set using parity bits and aggregate values (Mohd Rose et al. 2011; Rose et al. 2011). Sub-sequentially, Kong et al. (Kong et al. 2014) improved Zou et al. (Zou and Xiao 2008) approach of incomplete soft set by presenting an equivalent probability technique having less complexity and also determining actual missing data instead of only decision values determination. However, in reviewing Kong et al. approach, it still facing inherited shortcomings and low accuracy as compared to DFIS.
- (a)
We propose an alternative data filling approach for prediction of missing data in soft sets (ADFIS). The novelty of ADFIS is that, unlike the previous approach that used probability, we focus more on reliability of association between parameters.
- (b)
In contrast to DFIS, we revise association calculating procedure to predict maximum possible number of unknowns through association.
- (c)
To validate our work, we perform extensive experiment tests on 04 UCI benchmark and causality workbench lung cancer (LUCAP2) data sets to show the performance of ADFIS.
- (d)
We compare the results with other baseline approaches mentioned in the literatures.
Soft set
Let given U be an initial non-empty universal set and E be a set of parameters related to U. According to Molodtsov (1999), a pair (F, E) is called soft set over U if and only if F is mapping from E into the set of all subsets of the set U. The following example gives us illustration for a soft set.
Example 1
Representation of soft set in tabular form
Tabular representation of a soft set (F, E) in a Boolean-valued information system and its decision value
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} | d _{ i } |
---|---|---|---|---|---|---|---|
h _{1} | 1 | 1 | 0 | 0 | 0 | 1 | 3 |
h _{2} | 0 | 1 | 1 | 1 | 0 | 1 | 4 |
h _{3} | 1 | 1 | 1 | 0 | 0 | 0 | 3 |
h _{4} | 0 | 1 | 1 | 1 | 0 | 1 | 4 |
h _{5} | 1 | 0 | 0 | 0 | 1 | 1 | 3 |
From Table 1, the maximum value is 4 resulted by both houses h _{2} and h _{4}. Hence, either h _{2} or h _{4} can be his optimal house choice while other houses are sub-optimal options. In the following section, we discuss the incomplete soft set.
Incomplete soft set
An information system \(S^{*} = \left( {U,AT,V_{r} ,f} \right)\) is called incomplete if f(x _{ i }, a _{ j }) is not known, where, U = (x _{1}, x _{2}, …, x _{ n }), AT = (a _{1}, a _{2}, …, a _{ m }), \(x_{i} \in U\), i = (1, 2, 3, …, n) and \(a_{j} \in AT\) for j = (1, 2, 3, …, m). The following example presents an incomplete information system, where unknown entries in the table are represented by symbol “*”. The following example gives us illustration for an incomplete information system representing an incomplete soft set.
Example 2
Representation of incomplete soft set
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{4} | 1 | 0 | \(*_{1}\) | 0 | \(*_{2}\) | 1 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 |
s _{6} | 1 | 0 | 0 | \(*_{3}\) | 0 | 0 |
s _{7} | \(*_{4}\) | 1 | 1 | 1 | 0 | 0 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 |
From incomplete Boolean Table 2, we know that candidate 4 is young, inexperienced, having Ph.D. as his highest degree, but it is unknown that whether he is married and studied abroad or not. Similarly for candidate 6 and 7, the “highest degree is master” and “young age” values are unknown respectively. Hence it is an incomplete soft set with unknown values represented by \(*_{1}\), \(*_{2}\), \(*_{3}\) and \(*_{4}\).
Related works
In this section, we discuss three of previous soft set-based approaches for handling incomplete data. First we review each of these techniques one by one and then compare them to indicate the most appropriate one for soft set missing data prediction.
Zou et al. approach
Decision value calculated by Zou et al. technique for incomplete soft set of Example 2
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} | d _{ i } |
---|---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 | 3 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 | 2 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 | 2 |
s _{4} | 1 | 0 | \(*_{1}\) | 0 | \(*_{2}\) | 1 | 2.57 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 | 3 |
s _{6} | 1 | 0 | 0 | \(*_{3}\) | 0 | 0 | 1.43 |
s _{7} | \(*_{4}\) | 1 | 1 | 1 | 0 | 0 | 3.43 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 | 2 |
Qin et al. approach
The approach proposed by Qin et al. (Qin et al. 2012) prefers to predict missing value through association between parameters. This association is considered as the first case of their approach. For instance, in Example 1, it is an inconsistent association that an old house can’t be new and cheap can’t be expensive. Similarly, in same example beautiful house is most probably expensive is consistent association. In Example 2, a highest degree can be either master or doctorial indicating inconsistent associations.
Mathematical description of this technique is explained below.
Example 3
Predicting values through DFIS for incomplete case of Example 2. Here the parameters e _{1}, e _{3}, e _{4} and e _{5} have missing data.
Step 1 Finding consistency CN _{ ij } and inconsistency IN _{ ij }.
First we consider parameter 1 with 2: as only s _{8} has the same value equal to 0 for both e _{1} and e _{2}, therefore, CN _{12} = 1, as the values are not same for all other 6 objects excluding the missing s _{7}, therefore, IN _{12} = 6. Similarly, (CN _{13} = 1, IN _{13} = 5), (CN _{14} = 4, IN _{14} = 2), (CN _{15} = 4, IN _{15} = 2) and (CN _{16} = 2, IN _{16} = 5).
Step 2 Calculating ratio of consistency CD _{ ij } and ratio of inconsistency ID _{ ij }.
First we need to find cardinality (\(\left| {U_{ij} } \right|\)) for calculating CD _{ ij } and ID _{ ij }. As parameters 1 and 2 have seven complete pairs for all objects except object s _{7}, therefore, \(\left| {U_{12} } \right| = 7\). Similarly, \(\left| {U_{13} } \right| = \left| {U_{14} } \right| = \left| {U_{15} } \right| = 6\) and \(\left| {U_{16} } \right| = 7\).
Hence, CD _{12} = \({{CN_{12} } \mathord{\left/ {\vphantom {{CN_{12} } {\left| {U_{12} } \right|}}} \right. \kern-0pt} {\left| {U_{12} } \right|}}\) = 1/7 = 0.14 and ID _{12} = 0.86. Similarly, (CD _{13} = 0.16, ID _{13} = 0.83), (CD _{14} = 0.67, ID _{14} = 0.33), (CD _{15} = 0.67, ID _{15} = 0.33) and (CD _{16} = 0.28, ID _{16} = 0.83).
Step 3 Deciding whether association is consistent or inconsistent.
As D _{ ij } = max{CD _{ ij }, ID _{ ij }}, therefore, D _{12} = max{CD _{12}, ID _{12}} = max{0.86, 0.14} = 0.86. As the association is inconsistent therefore, minus (−) sign will be used for its indication and differentiation from consistent one i.e. D _{12} = −0.86. Similarly, D _{13} = −0.83, D _{14} = 0.67, D _{15} = 0.67 and D _{16} = −0.83.
Step 4 Calculating maximal degree of association.
Calculation of D _{ ij }
\(E^{*} /E\) | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
e _{1} | – | −0.86 | −0.83 | 0.67 | 0.67 | −0.83 |
e _{3} | −0.83 | 0.71 | – | ±0.5 | −0.67 | 0.57 |
e _{4} | 0.67 | 0.57 | ±0.5 | – | ±0.5 | −1 |
e _{5} | 0.67 | −0.57 | 0.57 | ±0.5 | – | 0.57 |
From Table 4, we see that for e _{1}, D _{1} = max{D_{12}, D_{13}, D_{14}, D_{15}, D_{16}} = max{0.86, 0.83, 0.67, 0.67, 0.83} = −0.86. Similarly, D _{3} = −0.83, D _{4} = −1 and D _{3} = 0.67.
Step 5 Putting values according to association
We set the threshold λ = 0.85. Only e _{1} and e _{4} are satisfying the condition to be calculated by association because, \(D_{1} = \left| { - 0.86} \right| > \lambda\) and \(D_{4} = \left| { - 1} \right| > \lambda\). From Table 4, e _{1} has inconsistent association with e _{2} and the corresponding element (u _{72}) of its missing element (\(*_{4}\) = u _{71}) has the value equal to 1 in Table 2. As complement value is assigned in case of inconsistent association, therefore, we put \(*_{4}\) = 0. Similarly, we calculate \(*_{3}\) = 1.
Step 6 Calculating probabilities for weak association.
Incomplete soft set completed using DFIS, predicted values are shown in italics
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{4} | 1 | 0 | 1 | 0 | 0 | 1 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 |
s _{6} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{7} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 |
Kong et al. approach
Incomplete soft set of Example 2 after completion and d _{ i } calculation using Kong et al. approach
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} | d _{ i } |
---|---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 | 3 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 | 2 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 | 2 |
s _{4} | 1 | 0 | \(\frac{4}{4 + 3}\) | 0 | \(\frac{0}{0 + 7}\) | 1 | 2.57 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 | 3 |
s _{6} | 1 | 0 | 0 | \(\frac{3}{3 + 4}\) | 0 | 0 | 1.43 |
s _{7} | \(\frac{3}{3 + 4}\) | 1 | 1 | 1 | 0 | 0 | 3.43 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 | 2 |
Comparison of previous approaches
As Zou et al. and Kong et al. approaches have approximately same results and Zou et al. approach is compared with DFIS with details (Kong et al. 2014). To conclude, we adopt below associative way for comparing all three previous techniques.
Zou et al. versus Kong et al
As Zou et al. approach calculates only decision value of incomplete soft set and the missing data remains still missing. While, Kong et al. approach has same results of d _{ i } as that of Zou et al. approach along with assigning a set of values to originally missed information. Secondly, the computational complexity of Kong et al. approach is O(n ^{2}) while that of Zou et al. approach is \(O\left( {n.2^{n} } \right)\) showing that Kong et al. approach is less complex compare to Zou et al. approach (Kong et al. 2014). Therefore, Kong et al. technique is more appropriate and efficient than Zou et al. approach.
Kong et al. versus DFIS
As Kong et al. approach works only on probability, ignoring any association between parameters might result probably in different values from actual. Secondly, it predicts missing values in [0, 1] range, while the actual value must be either 0 or 1 in standard soft set (Boolean information system). In contrast, DFIS prefer to predict actual values through association and use probability when the association is not strong. Secondly, in both cases, it calculates binary values maintaining the integrity of standard soft set. Thirdly, compare to Zou et al. results; its decision values results are much closer to actual values as shown in experimental results (Qin et al. 2012). The average of mean absolute percentage error (MAPE) of DFIS is 0.07, while that of Zou et al. approach is 0.11 for all five data sets used in DFIS. If we convert this average of MAPE to percent accuracy of both approaches then the average accuracy of DFIS is 93.17 % while that of Zou et al. approach is 89.12 % in calculating decision values. It is notable that Zou et al. and Kong et al. approaches have same results of decision values (Kong et al. 2014); consequently, the average accuracy of DFIS in decision values comes to be 4.04 % higher than Kong et al. technique. Hence DFIS is more suitable than Kong et al. approach.
- 1.
Access whole data set of m × n size once for getting the number of missing values
- 2.
Compute the degrees of consistencies and inconsistencies of complexity n
- 3.
Compute probability of n complexity when the association is weak
- 4.
Access once again m × n table for inserting the computed values
Comparison of previous approaches with DFIS
Hence, from above associative comparison visualized in Table 7, we conclude that DFIS is more suitable than Zou et al. and Kong et al. approaches for prediction of missing values in soft set. However, in reviewing DFIS, accuracy is still its main problem. Therefore, the following section discusses an alternative data filling approach for prediction of missing data in soft sets, namely ADFIS.
Alternative approach for data filling of incomplete soft sets
In this section an alternative approach for data filling of incomplete soft sets (ADFIS) is presented. The previous approach DFIS preferred association between parameters to predict missing values than probability and we discussed that association results in more accurate values than probability. But DFIS itself is unable to precisely consider all possible associations for getting more accurate results. In contrast to DFIS, we revise the association calculating method to consider all possible associations precisely and predict maximum possible number of unknowns through it. The novelty of ADFIS is that, it focuses more on reliability of association than DFIS.
Definition 1
Two parameters e _{ i } and e _{ j } are said to be consistent \(e_{i} \, \Leftrightarrow e_{j}\) with each other if there is strongest association between them. i.e. SA _{ ij } ≥ λ and max{CD _{ ij }, ID _{ ij }} = CD _{ ij }, where λ is a pre-set threshold values (for more details, see “Discussions”).
Definition 2
Two parameters e _{ i } and e _{ j } are said to be inconsistent \(e_{i}\;{ \Rrightarrow }\;e_{j}\) with each other if there is strongest inconsistent association between them. i.e. SA _{ ij } ≥ λ and max{CD _{ ij }, ID _{ ij }} = ID _{ ij }.
Definition 3
Two parameters e _{ i } and e _{ j } are said to be non-associated \(e_{i} { \nLeftrightarrow }e_{j}\) if there exist no strongest association between them i.e. SA _{ ij } < λ.
From above algorithm, the ADFIS firstly calculates the unknown(s) of the column having greatest association than all other columns among whole table. Before proceeding to further prediction, it inserts the recently calculated value(s) having strongest association in incomplete table. In next step, it again calculates association among parameters of whole table with consideration of the weight of recently inserted (most reliable) value(s) and finds strongest association again. The process of finding strongest association and predicting unknowns is repeated until all unknown data is filled or the condition of threshold disqualifies. In case of weak association, ADFIS uses simple comparison of n _{1} and n _{0} instead of calculating p _{1} and p _{0}.
The main difference between DFIS and ADFIS is that, DFIS calculates association among all parameters only once and decides on its base but ADFIS calculates it again and again after inserting the unknown value in one column being calculated through strongest association.
ADFIS is further explained for understanding and comparison with DFIS in Example 4 with same incomplete case of Example 2.
Example 4
Prediction of unknowns for incomplete soft set case Example 2 through ADFIS. Consider Example 2 and Table 2, for same case and same threshold value (λ = 0.85).
max{CD _{ ij }, ID _{ ij }} − 1
\(E^{*} /E\) | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
e _{1} | – | −0.86 | −0.83 | 0.67 | 0.67 | −0.83 |
e _{3} | −0.83 | 0.71 | – | ±0.5 | −0.67 | 0.57 |
e _{4} | 0.67 | 0.57 | ±0.5 | – | ±0.5 | −1 |
e _{5} | 0.67 | −0.57 | 0.57 | ±0.5 | – | 0.57 |
From Table 8, according to Eq. (7) SA _{46} = 1, for parameter 4 with parameter 6.
Incomplete case after inserting first calculated unknown (\(*_{3}\)) through strongest association
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{4} | 1 | 0 | \(\mathop *\nolimits_{1}\) | 0 | \(\mathop *\nolimits_{2}\) | 1 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 |
s _{6} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{7} | \(\mathop *\nolimits_{4}\) | 1 | 1 | 1 | 0 | 0 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 |
max{CD _{ ij }, ID _{ ij }} − 2 for updated Table 9
D _{ ij } | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
e _{1} | – | −0.86 | −0.83 | 0.71 | 0.57 | −0.71 |
e _{3} | −0.83 | 0.71 | – | −0.57 | −0.57 | 0.57 |
e _{5} | 0.57 | −0.57 | −0.57 | −0.57 | – | 0.57 |
Incomplete case after putting values of 1st and 2nd unknowns \(*_{3}\) and \(*_{4}\)
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{4} | 1 | 0 | \(\mathop *\nolimits_{1}\) | 0 | \(\mathop *\nolimits_{2}\) | 1 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 |
s _{6} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{7} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 |
Calculation of max{CD _{ ij }, ID _{ ij }} − 3 for updated Table 11
\(E^{*} /E\) | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
e _{3} | −0.86 | 0.71 | – | −0.57 | −0.57 | 0.57 |
e _{5} | 0.71 | −0.57 | −0.57 | −0.57 | – | 0.57 |
After putting value of \(*_{1} ,*_{3}\) and \(*_{4}\)
U/E | e _{1} | E _{2} | E _{3} | E _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{4} | 1 | 0 | 0 | 0 | \(\mathop *\nolimits_{2}\) | 1 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 |
s _{6} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{7} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 |
Calculation of max{CD _{ ij }, ID _{ ij }} − 4 for updated Incomplete Table 13
\(E^{*} /E\) | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
e _{5} | 0.71 | −0.57 | −0.57 | −0.57 | – | 0.57 |
Completed soft set using ADFIS
U/E | e _{1} | e _{2} | e _{3} | e _{4} | e _{5} | e _{6} |
---|---|---|---|---|---|---|
s _{1} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{2} | 0 | 1 | 0 | 0 | 0 | 1 |
s _{3} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{4} | 1 | 0 | 0 | 0 | 0 | 1 |
s _{5} | 0 | 1 | 1 | 0 | 0 | 1 |
s _{6} | 1 | 0 | 0 | 1 | 0 | 0 |
s _{7} | 0 | 1 | 1 | 1 | 0 | 0 |
s _{8} | 0 | 0 | 1 | 0 | 0 | 1 |
Results and discussion
In this section we discuss the improvement in accuracy of the ADFIS. Firstly, we discuss our incomplete case in Example 2 with prediction results by DFIS and ADFIS from Table 5 and Table 15, respectively. Then, we present the results obtained from DFIS and ADFIS for four UCI benchmark datasets Causality workbench LUCAP2 data set. Some important discussions are provided after the results presentations and shortcomings of ADFIS are also discussed at the end of this section.
Incomplete soft set of Example 2
Comparison of DFIS and ADFIS predicted values for incomplete case of Example 2
Unknown | Predicted results through | |||
---|---|---|---|---|
DFIS | ADFIS | |||
Value | Using | Value | Using | |
\(\mathop *\nolimits_{1}\) | 1 | Probability | 0 | Association |
\(\mathop *\nolimits_{2}\) | 0 | Probability | 0 | Probability |
\(\mathop *\nolimits_{3}\) | 1 | Association | 1 | Association |
\(\mathop *\nolimits_{4}\) | 0 | Association | 0 | Association |
UCI benchmark data sets
Similar to DFIS (Qin et al. 2012), we tested DFIS and ADFIS for four data sets from UCI benchmark database (UCI Machine Learning Repository 2013).
Zoo data set
Average performance of DFIS’s accuracy is 81.26 % while that of ADFIS is 84.67 % i.e. ADFIS performs 3.41 % accurate than DFIS for Zoo data set.
Flags data set
SPECT hearts data set
Average accuracy of DFIS is 76.41 % while that of ADFIS is 78.20 %. Hence ADFIS performs 1.80 % better than DFIS for SPECT hearts data set.
Congressional votes data set
This data set contains voting record of US congress members of 1984. 435 members had contested their votes in yes or no regarding 16 issues out of which only 230 members votes are completed. We selected these completed votes only for testing purpose and deleted randomly 161, 435, 122, 98, 263, 239, 205, 291, 424 and 136 values from this data set. After recalculating it though both approaches we found that DFIS average accuracy is 65.50 % while ADFIS has 72.98 % accuracy.
Causality workbench LUCAP2 data set
Overall accuracy comparison
Data sets | DFIS (%) | ADFIS (%) | Improvement (%) |
---|---|---|---|
Example 2 | 75.00 | 83.00 | 8.00 |
Zoo data set | 81.26 | 84.67 | 3.41 |
Flags data set | 74.02 | 78.10 | 4.08 |
SPECT hearts data set | 76.41 | 78.20 | 1.79 |
Congressional votes data set | 65.50 | 72.98 | 7.48 |
LUCAP2 data set | 71.61 | 73.49 | 1.89 |
Average | 4.44 |
From Table 17, we can conclude that the ADFIS performs up to 4.4 % better as compared to DFIS.
Discussions
In this sub-section we discuss some important queries that are raised regarding the threshold (λ), its function, range and suitable values. We also discuss the precise theoretical difference between DFIS and ADFIS, validation of proposed method and performance evaluation.
The threshold lambda (λ) is a filter that can be set according to the requirements of individuals in getting weak or strong associations. Closer the value of λ to 1 result in more reliable association and closer the value to zero might result in selecting weaker associations. To select more than 50 % associational results, the lambda must be fixed to 0.5 or above. In our incomplete case of example 2 we have kept the threshold λ = 0.85 to select only the parameters associations having minimum 85 % similarity between them and the unknowns of parameters having less than 85 % similarity are calculated through probability in DFIS while one of them (\(\mathop *\nolimits_{1}\)) enters to the threshold range in ADFIS case. This reveals the core difference between DFIS and ADFIS. DFIS calculates all associations once for whole data set and assigns missing values according to it. We notice that those parameters satisfying the threshold can be further categorized in less and more stronger association in the range between threshold and 1. Two parameters might have marginal similarity of 85 % while another set of two may have stronger similarity as 90 % or even 100 %. DFIS treat them all as same for finding missing values, while we calculate the unknown first through the strongest among them and utilize it for its role in upcoming calculations. This way, some of the unknowns that are calculated through probability enters association range and get more probable accurate results, as calculating unknowns through association is more reliable than probability (Qin et al. 2012). The results of DFIS are validated by calculating its decision values and comparing its MAPE with that of Zou et al. approach. As Zou et al. approach does not calculate missing values; therefore DFIS used indirect method of validation. But in our case, both DFIS and ADFIS calculate actual missing values and we do not need to validate it through indirect decision values. So, we use direct method of comparing both techniques’ actual results with original and the more accuracy of ADFIS validates its better performance.
Weaknesses of the ADFIS
Apart from improved accuracy, there are two main limitations of ADFIS compare to DFIS.
Incorrect results rare cases
Sometimes the strongest association becomes false because of too much missing values or no real association existence. In this case, if missing values calculated in first step of ADFIS are incorrect then it affects the result of calculated values in next steps as well. This case can be viewed in the 2nd and 9th test result of SPECT Hearts data set graph where DFIS has high accuracy than ADFIS.
High computational complexity
High computational complexity of ADFIS compare to DFIS is obvious. DFIS access a data set of m × n size once for finding association while ADFIS (m × n)^{2} times during its execution. Complexity of ADFIS is DFIS times more than that of DFIS.
Conclusion
In this paper, we have discussed three previous approaches for prediction of incomplete soft set and pointed out DFIS as most suitable among them. We have presented an alternative approach of data filling for incomplete soft set (ADFIS) for the purpose of accuracy improvement. We have re-arranged the process of DFIS, therefore the maximum possible number of unknowns in incomplete soft set can be predicted through association between parameters. We have presented a modified algorithm and explain our ADFIS with the help of an example as a proof of concept. We have also compared the results of ADFIS with the existing DFIS approach after implementing both in Matlab for four UCI benchmark data sets and Causality workbench lung cancer data set (LUCAP2) and shared the average results of both approaches in the form of graphs. ADFIS has improved the percentage of accuracy of predicted unknowns by 4.44 % average as compared to DFIS for all 5 data sets. We mentioned two main snags of ADFIS i.e. rare cases wrong values prediction and high computational complexity which can be resolved in its future work.
Declarations
Authors’ contributions
MSK, MAA, AWAW and TH designed experiments and analyzed results. MSK and MAA performed experiments, prepared figures and wrote manuscript. All authors read and approved the final manuscript.
Acknowledgements
This work is supported by University of Malaya Research Grant No. RP03615AET.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
- Cagman N, Enginoglu S (2012) Fuzzy soft matrix theory and its application in decision making. Iran J Fuzzy Syst 9(1):109–119Google Scholar
- Cagman N, Enginoglu S, Citak F (2011) Fuzzy soft set theory and its applications. Iran J Fuzzy Syst 8(3):137–147Google Scholar
- Causality Workbench (2013) http://www.causality.inf.ethz.ch/challenge.php?page=datasets. Accessed 5 Dec 2015
- Çelik Y, Yamak S (2013) Fuzzy soft set theory applied to medical diagnosis using fuzzy arithmetic operations. J Inequal Appl 2013(1):1–9View ArticleGoogle Scholar
- Chen D, Tsang E, Yeung DS, Wang X (2005) The parameterization reduction of soft sets and its applications. Comput Math Appl 49(5):757–763View ArticleGoogle Scholar
- Herawan T, Deris MM (2011) A soft set approach for association rules mining. Knowl Based Syst 24(1):186–195View ArticleGoogle Scholar
- Jun YB, Park CH (2008) Applications of soft sets in ideal theory of BCK/BCI-algebras. Inf Sci 178(11):2466–2475Google Scholar
- Jun YB, Lee KJ, Park CH (2009) Soft set theory applied to ideals in d-algebras. Comput Math Appl 57(3):367–378View ArticleGoogle Scholar
- Kalaichelvi A, Malini PH (2011) Application of fuzzy soft sets to investment decision making problem. Intern J Math Sci Appl 1(3):1583–1586Google Scholar
- Kalayathankal SJ, Singh GS (2010) A fuzzy soft flood alarm model. Math Comput Simul 80(5):887–893View ArticleGoogle Scholar
- Kong Z, Gao L, Wang L, Li S (2008) The normal parameter reduction of soft sets and its algorithm. Comput Math Appl 56(12):3029–3037View ArticleGoogle Scholar
- Kong Z, Zhang G, Wang L, Wu Z, Qi S, Wang H (2014) An efficient decision making approach in incomplete soft set. Appl Math Model 38(7):2141–2150View ArticleGoogle Scholar
- Ma X, Sulaiman N, Qin H, Herawan T, Zain JM (2011) A new efficient normal parameter reduction algorithm of soft sets. Comput Math Appl 62(2):588–598View ArticleGoogle Scholar
- Maji P, Roy AR, Biswas R (2002) An application of soft sets in a decision making problem. Comput Math Appl 44(8):1077–1083View ArticleGoogle Scholar
- Mohd Rose AN, Hassan H, Awang MI, Mahiddin NA, Mohd Amin H, Deris MM (2011) Solving incomplete datasets in soft set using supported sets and aggregate values. Procedia Comput Sci 5:354–361View ArticleGoogle Scholar
- Molodtsov D (1999) Soft set theory—first results. Comput Math Appl 37(4):19–31View ArticleGoogle Scholar
- Qin H, Ma X, Herawan T, Zain JM (2012) DFIS: a novel data filling approach for an incomplete soft set. Int J Appl Math Comput Sci 22(4):817–828View ArticleGoogle Scholar
- Rose ANM, Hassan H, Awang MI, Herawan T, Deris MM (2011) Solving incomplete datasets in soft set using parity bits of supported sets ubiquitous computing and multimedia applications. Springer, Berlin, pp 33–43Google Scholar
- Tanay B, Kandemir MB (2011) Topological structure of fuzzy soft sets. Comput Math Appl 61(10):2952–2957View ArticleGoogle Scholar
- UCI Machine Learning Repository (2013) https://archive.ics.uci.edu/ml/datasets.html. Accessed 5 Dec 2015
- Xiao Z, Gong K, Zou Y (2009) A combined forecasting approach based on fuzzy soft sets. J Comput Appl Math 228(1):326–333View ArticleGoogle Scholar
- Yuksel S, Dizman T, Yildizdan G, Sert U (2013) Application of soft sets to diagnose the prostate cancer risk. J Inequal Appl 2013(1):1–11View ArticleGoogle Scholar
- Zou Y, Xiao Z (2008) Data analysis approaches of soft sets under incomplete information. Knowl Based Syst 21(8):941–945View ArticleGoogle Scholar