An alternative data filling approach for prediction of missing data in soft sets (ADFIS)

Soft set theory is a mathematical approach that provides solution for dealing with uncertain data. As a standard soft set, it can be represented as a Boolean-valued information system, and hence it has been used in hundreds of useful applications. Meanwhile, these applications become worthless if the Boolean information system contains missing data due to error, security or mishandling. Few researches exist that focused on handling partially incomplete soft set and none of them has high accuracy rate in prediction performance of handling missing data. It is shown that the data filling approach for incomplete soft set (DFIS) has the best performance among all previous approaches. However, in reviewing DFIS, accuracy is still its main problem. In this paper, we propose an alternative data filling approach for prediction of missing data in soft sets, namely ADFIS. The novelty of ADFIS is that, unlike the previous approach that used probability, we focus more on reliability of association among parameters in soft set. Experimental results on small, 04 UCI benchmark data and causality workbench lung cancer (LUCAP2) data shows that ADFIS performs better accuracy as compared to DFIS.

2012; Cagman et al. 2011;Çelik and Yamak 2013;Herawan and Deris 2011;Jun et al. 2009;Jun and Park 2008;Kalaichelvi and Malini 2011;Kalayathankal and Singh 2010;Tanay and Kandemir 2011;Xiao et al. 2009;Yuksel et al. 2013). But in some applications, researchers faced problem of incomplete soft set cases with partially missing values. Soft and its related sets data can be missed due to many factors such as improper entry, viral attack, security reasons and errors during data transfer. Incomplete soft sets can be no longer applied in any application or may yield extra-large, very small, unexpected and misleading results, if still applied. Such results, especially a wrong decision making can cause a huge loss to an individual or organizations. For coping with this situation, Zou et al. presented their techniques of weighted-average for calculating decision values and average probability for prediction of missing values in soft set and fuzzy soft set respectively (Zou and Xiao 2008). Qin et al. proposed DFIS where it indicated that data prediction in incomplete soft set is more reliable and accurate if recalculated through association between parameters and they used simple probability for cases having zero or weak association (Qin et al. 2012). Rose et al. also contributed in completion of incomplete soft set using parity bits and aggregate values (Mohd Rose et al. 2011;Rose et al. 2011). Sub-sequentially, Kong et al. (Kong et al. 2014) improved Zou et al. (Zou and Xiao 2008) approach of incomplete soft set by presenting an equivalent probability technique having less complexity and also determining actual missing data instead of only decision values determination. However, in reviewing Kong et al. approach, it still facing inherited shortcomings and low accuracy as compared to DFIS.
In this paper, we compare all exiting approaches in term of accuracy and computational complexity and find DFIS as most suitable among them for predicting missing values in incomplete soft set. We propose an alternative data filling approach for prediction of missing data in soft sets. In summary the contribution of this work is described as follow: (a) We propose an alternative data filling approach for prediction of missing data in soft sets (ADFIS). The novelty of ADFIS is that, unlike the previous approach that used probability, we focus more on reliability of association between parameters. (b) In contrast to DFIS, we revise association calculating procedure to predict maximum possible number of unknowns through association. (c) To validate our work, we perform extensive experiment tests on 04 UCI benchmark and causality workbench lung cancer (LUCAP2) data sets to show the performance of ADFIS. (d) We compare the results with other baseline approaches mentioned in the literatures.

Soft set
Let given U be an initial non-empty universal set and E be a set of parameters related to U. According to Molodtsov (1999), a pair (F, E) is called soft set over U if and only if F is mapping from E into the set of all subsets of the set U. The following example gives us illustration for a soft set.
Example 1 Suppose U = {h 1 , h 2 , h 3 , h 4 , h 5 } is a set of houses and E = {e 1 , e 2 , e 3 , e 4 , e 5 , e 6 } is the set of parameters in relation to each house. Each member of E represents cheap, new, wooden, expensive, old and beautiful house, respectively. Let cheap houses are h 1 , h 3 , h 5 , new houses are h 1 , h 2 , h 3 , h 4 , wooden houses are h 2 , h 3 , h 4 , expensive houses are h 2 , h 4 , old house is h 5 and beautiful houses are h 1 , h 2 , h 4 , h 5 . Here, the pair (F, E) describing the attractiveness a soft set given by

Representation of soft set in tabular form
If U is finite non-empty set of objects, AT is the non-empty finite set of attributes, V = ∪V r such that V r is the value domain of attribute and f is an information function given by f : U × AT → V r . Then the quaternion S = (U, AT, V r , f) is called an information system (Ma et al. 2011). The soft set (F, E) in Example 1 is represented in Table 1 i.e. in a Boolean information system.
In above Table, the objects are represented in rows and parameters in columns. Parameters belonging to a particular object are simply represented by 1 otherwise 0. In soft set-based decision making, the decision value or choice for Mr. Gul among all these houses is given by where optimal choice is max (d i ) and h ij are the values of elements.
From Table 1, the maximum value is 4 resulted by both houses h 2 and h 4 . Hence, either h 2 or h 4 can be his optimal house choice while other houses are sub-optimal options. In the following section, we discuss the incomplete soft set.

Incomplete soft set
is not known, where, U = (x 1 , x 2 , …, x n ), AT = (a 1 , a 2 , …, a m ), x i ∈ U, i = (1, 2, 3, …, n) and a j ∈ AT for j = (1, 2, 3, …, m). The following example presents an incomplete information system, where unknown entries in the table are represented by symbol "*". The following example gives us illustration for an incomplete information system representing an incomplete soft set.
Example 2 Suppose U = (s 1 , s 2 , s 3 , …, s 8 ) is a set of applicants with parameters set E = {e 1 , e 2 , e 3 , e 4 , e 5 , e 6 } representing "young age", "experienced", "married", "the highest academic degree is Master", "studied abroad", and "the highest academic degree is Doctor", respectively with its soft set illustration in presented as a Boolean-valued information system in Table 2.
From incomplete Boolean Table 2, we know that candidate 4 is young, inexperienced, having Ph.D. as his highest degree, but it is unknown that whether he is married and studied abroad or not. Similarly for candidate 6 and 7, the "highest degree is master" and "young age" values are unknown respectively. Hence it is an incomplete soft set with unknown values represented by * 1 , * 2 , * 3 and * 4 .

Related works
In this section, we discuss three of previous soft set-based approaches for handling incomplete data. First we review each of these techniques one by one and then compare them to indicate the most appropriate one for soft set missing data prediction.

Zou et al. approach
The approach of Zou et al. (Zou and Xiao 2008) has used weighted average technique for decision value calculation of incomplete soft set while incomplete fuzzy soft set's missing data is predicted through average probability. Here, in relation to our work, we discuss their soft set case only. According to this approach d i = m i=1 k i c i where d i is the required decision value c i is the choice value, m is maximum number of choices for same object having missing value and k i is the weight of choice values. For one missing value, the choice values of an object are only two (0 or 1), hence its respected weights are k 1 = n 0 n 0 +n 1 = q e i and k 2 = n 1 n 1 +n 0 = p e i . For more than one missing values t of same object, the choice values increases and its respective weight values are calculated by  where, x is the number of 1s in the row, while E * 1 and E * 0 are its parameter sets for value 1 and 0 respectively. Using this approach, the decision value in term of candidate's eligibility for incomplete Table 2 is calculated as explained in related article (Zou and Xiao 2008) and given in Table 3.

Qin et al. approach
The approach proposed by Qin et al. (Qin et al. 2012) prefers to predict missing value through association between parameters. This association is considered as the first case of their approach. For instance, in Example 1, it is an inconsistent association that an old house can't be new and cheap can't be expensive. Similarly, in same example beautiful house is most probably expensive is consistent association. In Example 2, a highest degree can be either master or doctorial indicating inconsistent associations.
Mathematical description of this technique is explained below. The consistent association between two parameters is found by where CN ij is the number of elements in column (parameter) i having same value to the number of parameter (column) j.

Consistent association degree is calculated by
where U ij is the cardinality (absolute number) of known element's pairs for parameter i and j. i.e. CD ij is the ratio of consistency to number of total elements in columns i and j. Similarly, inconsistent association is found as And inconsistent association degree is calculated by To know that whether the association is consistent or inconsistent, net association degree is obtained by To find the two parameters having maximum association with each other, the maximal association degree is obtained among the set of all association degrees by As a result, the unknown(s) value F e i (x) is predicted as same as the corresponding element(s) j (0 for 0 and 1 for 1) if the association is consistent, otherwise it is predicted as a complement of the parameter j for inconsistent association.
In second case, when there is weak association between parameters i.e. |D i | < λ, where λ is a pre-set threshold value. Then, probability for zero and one is calculated as where n 1 and n 0 are the number of 1s and 0s respectively for the parameter having missing data. As a result, the missing value is put as 1 if p 1 > p o , 0 if p 1 < p o and either 1 or 0 if p 1 = p o . The following example explains DFIS approach step by step.
Example 3 Predicting values through DFIS for incomplete case of Example 2. Here the parameters e 1 , e 3 , e 4 and e 5 have missing data.
Step 1 Finding consistency CN ij and inconsistency IN ij .
First we consider parameter 1 with 2: as only s 8 has the same value equal to 0 for both e 1 and e 2 , therefore, CN 12 = 1, as the values are not same for all other 6 objects excluding the missing s 7 , therefore, IN 12 = 6. Similarly, ( Step 2 Calculating ratio of consistency CD ij and ratio of inconsistency ID ij . First we need to find cardinality ( U ij ) for calculating CD ij and ID ij . As parameters 1 and 2 have seven complete pairs for all objects except object s 7 , therefore, |U 12 | = 7. Similarly, |U 13 | = |U 14 | = |U 15 | = 6 and |U 16 | = 7.
Step 3 Deciding whether association is consistent or inconsistent.
Step 4 Calculating maximal degree of association.
Step 5 Putting values according to association We set the threshold λ = 0.85. Only e 1 and e 4 are satisfying the condition to be calculated by association because, D 1 = |−0.86| > and D 4 = |−1| > . From Table 4, e 1 has inconsistent association with e 2 and the corresponding element (u 72 ) of its missing element ( * 4 = u 71 ) has the value equal to 1 in Table 2. As complement value is assigned in case of inconsistent association, therefore, we put * 4 = 0. Similarly, we calculate * 3 = 1.
Step 6 Calculating probabilities for weak association.

Kong et al. approach
The approach proposed by Kong et al. (Kong et al. 2014) is equivalent to Zou et al. approach (Zou and Xiao 2008) in results but more simplified with respect to complexity. Instead of using weighted-average huge computations, its uses simple probability p ′ e j = n 1 n 1 +n 0 for calculating an unknown value, where n 1 and n 0 are the number of 1 and 0 respectively for same parameter. After inserting this value in unknown the decision   et al. SpringerPlus (2016et al. SpringerPlus ( ) 5:1348 value is calculated by d i = m j=1 h ij . Using this technique, the incomplete Example 2 gets completed as given in Table 6 along with decision value d i .

Comparison of previous approaches
As Zou et al. and Kong et al. approaches have approximately same results and Zou et al. approach is compared with DFIS with details (Kong et al. 2014). To conclude, we adopt below associative way for comparing all three previous techniques.  1. Access whole data set of m × n size once for getting the number of missing values 2. Compute the degrees of consistencies and inconsistencies of complexity n 3. Compute probability of n complexity when the association is weak 4. Access once again m × n table for inserting the computed values Combining all, results in m × n + n + n + m × n = 2mn + 2n. Supposing m = n and considering big O notation, then 2mn + 2n = 2n 2 + 2n ≥ 2n 2 ≥ n 2 for larger values of n. Hence, the complexity of DFIS is O(n 2 ), which is equal to the complexity of Kong et al. approach. Therefore, DFIS is most appropriate for missing data prediction in soft set among all three previous approaches. This comparison is summarized in Table 7 as follow:

Zou et al. versus Kong et al
Hence, from above associative comparison visualized in Table 7, we conclude that DFIS is more suitable than Zou et al. and Kong et al. approaches for prediction of missing values in soft set. However, in reviewing DFIS, accuracy is still its main problem. Therefore, the following section discusses an alternative data filling approach for prediction of missing data in soft sets, namely ADFIS.

Alternative approach for data filling of incomplete soft sets
In this section an alternative approach for data filling of incomplete soft sets (ADFIS) is presented. The previous approach DFIS preferred association between parameters to predict missing values than probability and we discussed that association results in more accurate values than probability. But DFIS itself is unable to precisely consider all possible associations for getting more accurate results. In contrast to DFIS, we revise the association calculating method to consider all possible associations precisely and predict maximum possible number of unknowns through it. The novelty of ADFIS is that, it focuses more on reliability of association than DFIS.
For ADFIS, we use Eqs.
(1)-(4) to calculate consistent and inconsistent associations and its consistency degrees as DFIS. In case of DFIS, for n number of parameters containing missing values, Eq. (5) gives n number of D ij s and Eq. (6) is applied separately to each parameter for calculating maximum degree for parameter i with parameter j. Therefore, Eqs. (5) and (6) are not applied to ADFIS directly. To select one value as the strongest association among all parameters, we use below relation.
where CD ij , ID ij are the degrees of consistencies and inconsistencies of each parameter i containing missing values with all other parameters j and SA ij is the strongest association among all parameters, between parameter i (containing unknown) and (corresponding) parameter j. The following definition presents the notion of consistency between two parameters.
Definition 1 Two parameters e i and e j are said to be consistent e i ⇔ e j with each other if there is strongest association between them. i.e. SA ij ≥ λ and max{CD ij , ID ij } = CD ij , where λ is a pre-set threshold values (for more details, see "Discussions").
From Definition 1, it can be seen that if two parameters are consistent to each other, then its corresponding elements are also consistent with each other. If e i ⇔ e j then F (e) ni ⇔ F (e) nj , if F (e) ni = * then where, * is unknown and n is the object position (row) of parameter value F(e). The following definition presents the notion of inconsistency between two parameters. Definition 2 Two parameters e i and e j are said to be inconsistent e i ⇛ e j with each other if there is strongest inconsistent association between them. i.e. SA ij ≥ λ and max{CD ij , ID ij } = ID ij .
From Definition 2, it can be seen that if two parameters are inconsistent to each other, then its corresponding elements are also inconsistent with each other. If e i ⇛ e j then F (e) ni ⇛ F (e) nj , if F (e) ni = * then where, * is unknown and n is the object position (row) of parameter value F(e). The following definition presents the notion of non-association between two parameters.

Definition 3
Two parameters e i and e j are said to be non-associated e i e j if there exist no strongest association between them i.e. SA ij < λ. Khan et al. SpringerPlus (2016) 5:1348 From Definitions 1-3, we derive our proposed algorithm of ADFIS as described below.
From above algorithm, the ADFIS firstly calculates the unknown(s) of the column having greatest association than all other columns among whole table. Before proceeding to further prediction, it inserts the recently calculated value(s) having strongest association in incomplete table. In next step, it again calculates association among parameters of whole table with consideration of the weight of recently inserted (most reliable) value(s) and finds strongest association again. The process of finding strongest association and predicting unknowns is repeated until all unknown data is filled or the condition of threshold disqualifies. In case of weak association, ADFIS uses simple comparison of n 1 and n 0 instead of calculating p 1 and p 0 .
The main difference between DFIS and ADFIS is that, DFIS calculates association among all parameters only once and decides on its base but ADFIS calculates it again and again after inserting the unknown value in one column being calculated through strongest association.
ADFIS is further explained for understanding and comparison with DFIS in Example 4 with same incomplete case of Example 2.
Example 4 Prediction of unknowns for incomplete soft set case Example 2 through ADFIS. Consider Example 2 and Table 2, for same case and same threshold value (λ = 0.85).
Step 1 We construct Table 8 containing the values of max{CD ij , ID ij }.
Step 2 Including the weight of recently calculated * 3 in Table 9, we calculate Table 10 containing the new values of max{CD ij , ID ij }.
In Table 10, the strongest association is that of e 1 with e 2 , SA 12 = |−0.86| > λ, similar to step 1, we put * 4 = 0 and obtain updated Table 11.      Step 3 Based on updated Table 11, we recalculate max{CD ij , ID ij } in Table 12 as follow.
It can be observed from Table 12 that SA 31 = |−0.86| > λ also entered into defined threshold range of association and we put * 1 = 0 getting updated incomplete case in Table 13.
Step 4 The value of max{CD ij , ID ij } for Table 13 is recalculated in Table 14 as follow: As SA 51 = 0.71 in Table 14 means e 5 e 1 therefore, * 2 cannot be calculated through association for λ = 0.85. This case is falling under definition 3 and we use probability for it. We see from Table 13, that for e 5 , n 1 = 0 and n 0 = 7. As n 0 > n 1 therefore, we put * 2 = 0. Hence, using ADFIS, we obtained all missing values in complete Table 15.

Results and discussion
In this section we discuss the improvement in accuracy of the ADFIS. Firstly, we discuss our incomplete case in Example 2 with prediction results by DFIS and ADFIS from Table 5 and Table 15, respectively. Then, we present the results obtained from DFIS and ADFIS for four UCI benchmark datasets Causality workbench LUCAP2 data set. Some important discussions are provided after the results presentations and shortcomings of ADFIS are also discussed at the end of this section.

Incomplete soft set of Example 2
Refer to comparison Table 16, all values predicted through DFIS are same as ADFIS except * 1 , although the threshold is same for both approaches. * 1 got neither only complemented value for both techniques but also calculated through different ways i.e. through association in ADFIS and probability through DFIS. The DFIS proves that association is more reliable than probability; therefore we claim that the value of * 1 calculated as 0 using association by ADFIS is more accurate than predicted as 1 by DFIS using probability. Suppose an unknown predicted though association has 90 % accuracy and that predicted through probability has 60 %. Then the average accuracy of DFIS is 75 % while that of ADFIS is 83 % for this case as shown through graph in Fig. 1.

UCI benchmark data sets
Similar to DFIS (Qin et al. 2012), we tested DFIS and ADFIS for four data sets from UCI benchmark database (UCI Machine Learning Repository 2013).
We randomly deleted 30-600 entries ten times from Zoo, Flags, Congressional votes and SPECT hearts data sets and re-calculated it using both approaches by implementing  Fig. 2. Now we discuss experimental results of each data set one by one.

Zoo data set
Zoo data set contains 101 types of different animals with their 18 different features like presence of feather, teeth, backbone and hair. We selected only 15 parameters having Boolean values and randomly deleted ten times the number of values 91,87,107,91,97,98,79,82,93 and 88 from it. All deleted values are recalculated using both (DFIS and ADFIS) approaches. Percent accuracy graph of these results is given in Fig. 3. Average performance of DFIS's accuracy is 81.26 % while that of ADFIS is 84.67 % i.e. ADFIS performs 3.41 % accurate than DFIS for Zoo data set.

Flags data set
Flags dataset contains national flags description of 128 countries with 28 parameters. Out of all only 13 parameters are Boolean which are selected for our testing purpose. Accuracy graph for randomly deleted number of values 110,43,151,92,84,151,200,538,189 and 49 is given in Fig. 4 for flag data set. Performance of ADFIS is 4.08 % better than DFIS as DFIS average accuracy is 74.02 % while that of ADFIS is 78.10 %.  , 98, 450, 182, 230, 62, 161, 47, 290 and 102. Percent performance graph is shown in Fig. 5.
Average accuracy of DFIS is 76.41 % while that of ADFIS is 78.20 %. Hence ADFIS performs 1.80 % better than DFIS for SPECT hearts data set.

Congressional votes data set
This data set contains voting record of US congress members of 1984. 435 members had contested their votes in yes or no regarding 16 issues out of which only 230 members votes are completed. We selected these completed votes only for testing purpose and deleted randomly 161,435,122,98,263,239,205,291,424 and 136 values from this data set. After recalculating it though both approaches we found that DFIS average accuracy is 65.50 % while ADFIS has 72.98 % accuracy.
Average performance of ADFIS is 7.84 % better than DFIS for this data set. Performance graph of ADFIS vs DFIS is plotted in Fig. 6.

Causality workbench LUCAP2 data set
Lung Cancer set with Probes (LUCAP) (Causality Workbench 2013) is an online data set containing Boolean valued artificially generated data by causal Bayesian networks. There are ten thousand imaginary objects (patients) with 143 features (symptoms) like Coughing, Fatigue, Yellow Fingers, Anxiety, Allergy, Attention Disorder and Smoking. Out of 10,000 we selected only first 1000 with all 143 parameters for our testing purpose. We randomly deleted 322,2354,1190,2083,1432,1158,5413,2457,899 and 760 number of values and recalculated it through DFIS and ADFIS. We found that for 1807 average unknowns, DFIS calculated 1294, while ADFIS calculated 1328 accurate values. Hence the average performance of ADFIS is 1.89 % better than DFIS for this data set. Percent accuracy graph of DFIS versus ADFIS for LUCP2 data set is given in Fig. 7.
In summary, the overall comparison results are given in the following Table 17.
From Table 17, we can conclude that the ADFIS performs up to 4.4 % better as compared to DFIS.

Discussions
In this sub-section we discuss some important queries that are raised regarding the threshold (λ), its function, range and suitable values. We also discuss the precise theoretical difference between DFIS and ADFIS, validation of proposed method and performance evaluation.
The threshold lambda (λ) is a filter that can be set according to the requirements of individuals in getting weak or strong associations. Closer the value of λ to 1 result in more reliable association and closer the value to zero might result in selecting weaker associations. To select more than 50 % associational results, the lambda must be fixed to 0.5 or above. In our incomplete case of example 2 we have kept the threshold λ = 0.85 to select only the parameters associations having minimum 85 % similarity between them and the unknowns of parameters having less than 85 % similarity are calculated through probability in DFIS while one of them ( * 1 ) enters to the threshold range in ADFIS case. This reveals the core difference between DFIS and ADFIS. DFIS calculates all associations once for whole data set and assigns missing values according to it. We notice that those parameters satisfying the threshold can be further categorized in less and more stronger association in the range between threshold and 1. Two parameters might have marginal similarity of 85 % while another set of two may have stronger similarity as 90 % or even 100 %. DFIS treat them all as same for finding missing values, while we calculate the unknown first through the strongest among them and utilize it for its role in upcoming calculations. This way, some of the unknowns that are calculated through probability enters association range and get more probable accurate results, as calculating unknowns through association is more reliable than probability (Qin et al. 2012). The results of DFIS are validated by calculating its decision values and comparing its MAPE with that of Zou et al. approach. As Zou et al. approach does not calculate missing values; therefore DFIS used indirect method of validation. But in our case, both DFIS and ADFIS calculate actual missing values and we do not need to validate it through indirect decision values. So, we use direct method of comparing both techniques' actual results with original and the more accuracy of ADFIS validates its better performance.

Weaknesses of the ADFIS
Apart from improved accuracy, there are two main limitations of ADFIS compare to DFIS.

Incorrect results rare cases
Sometimes the strongest association becomes false because of too much missing values or no real association existence. In this case, if missing values calculated in first step of ADFIS are incorrect then it affects the result of calculated values in next steps as well. This case can be viewed in the 2nd and 9th test result of SPECT Hearts data set graph where DFIS has high accuracy than ADFIS.

High computational complexity
High computational complexity of ADFIS compare to DFIS is obvious. DFIS access a data set of m × n size once for finding association while ADFIS (m × n) 2 times during its execution. Complexity of ADFIS is DFIS times more than that of DFIS.

Conclusion
In this paper, we have discussed three previous approaches for prediction of incomplete soft set and pointed out DFIS as most suitable among them. We have presented an alternative approach of data filling for incomplete soft set (ADFIS) for the purpose of accuracy improvement. We have re-arranged the process of DFIS, therefore the maximum possible number of unknowns in incomplete soft set can be predicted through association between parameters. We have presented a modified algorithm and explain our ADFIS with the help of an example as a proof of concept. We have also compared the results of ADFIS with the existing DFIS approach after implementing both in Matlab for four UCI benchmark data sets and Causality workbench lung cancer data set (LUCAP2) and shared the average results of both approaches in the form of graphs. ADFIS has improved the percentage of accuracy of predicted unknowns by 4.44 % average as compared to DFIS for all 5 data sets. We mentioned two main snags of ADFIS i.e. rare cases wrong values prediction and high computational complexity which can be resolved in its future work.