 Research
 Open Access
Evolutionary approach to violating group anonymity using thirdparty data
 Dan Tavrov^{1}Email authorView ORCID ID profile and
 Oleg Chertov^{1}
Received: 4 September 2015
Accepted: 8 January 2016
Published: 26 January 2016
Abstract
In the era of Big Data, it is almost impossible to completely restrict access to primary nonaggregated statistical data. However, risk of violating privacy of individual respondents and groups of respondents by analyzing primary data has not been reduced. There is a need in developing subtler methods of data protection to come to grips with these challenges. In some cases, individual and group privacy can be easily violated, because the primary data contain attributes that uniquely identify individuals and groups thereof. Removing such attributes from the dataset is a crude solution and does not guarantee complete privacy. In the field of providing individual data anonymity, this problem has been widely recognized, and various methods have been proposed to solve it. In the current work, we demonstrate that it is possible to violate group anonymity as well, even if those attributes that uniquely identify the group are removed. As it turns out, it is possible to use thirdparty data to build a fuzzy model of a group. Typically, such a model comes in a form of a set of fuzzy rules, which can be used to determine membership grades of respondents in the group with a level of certainty sufficient to violate group anonymity. In the work, we introduce an evolutionary computing based method to build such a model. We also discuss a memetic approach to protecting the data from group anonymity violation in this case.
Keywords
 Group anonymity
 Privacypreserving data publishing
 Fuzzy inference
 Memetic algorithm
 Microfile
 Subgroup discovery
Background
A son of Dmitrii Mendeleyev, the worldrenowned chemist and creator of the periodic table, recalls (Tishchenko and Mladientsev 1993, pp. 353–354) an interesting fact. In 1890, his father came up with a formula of the smokeless pirokollody gunpowder (Gordin 2003), which at the time was thoroughly protected by French manufacturers. As it turned out, Mendeleyev’s findings were based on analyzing public statistical data from the railroad company annual report on freight traffic. A separate branch line supplied the gunpowder factory. Annual statistics provided all the necessary information to easily retrieve the gunpowder composition ratios.
One hundred and twenty five years later, in the era of Big Data, various statistical data are publicly available. The task of ensuring that security intensive information does not leak out becomes much more challenging. A modern man lives and works in a society oriented toward collecting and storing data on each and every person. Statistics services do that (census forms), taxing services do that (tax declarations), medical facilities do that (patient’s medical records), law enforcement agencies do that (person’s IDs), employers do that (CVs), retail stores do that (personal discount cards), security does that (security cameras files), and so on and so forth. Problems of preserving privacy in such data are widely discussed within the field of privacypreserving data publishing (Fung et al. 2010; Wong and Fu 2010). To great extent, appropriate protection implies removing identifiers (passport data, full name etc.), and distorting the data (e.g., values of certain characteristics are swapped between respondents or get noised) or suppressing them (e.g., data on elder people are grouped in a category of senior citizens).
At the same time, problems of protecting group distributions for certain categories of respondents remain unsolved. Let us consider a case when abnormal concentration of nuclear physicists on a specific territory reveals the site of a secret nuclear research facility. Of course, removing such attributes as Occupation or Industry seems to be a first choice. However, the risk of privacy violation remains high if there is information about where respondents pursued their higher education (e.g., National Institute for Nuclear Science and Technology for academic training in atomic energetics is situated in Saclay commune, France), or about where they lived (for instance, Dubna, Russian Federation, is a home to Joint Institute for Nuclear Research). Therefore, the task of protecting distributions for a certain group of respondents (which can be persons, households, enterprises etc.) with minimal distortion of primary statistical data is a pressing one.
There are numerous practical cases when we do not have attributes at our disposal that classify a respondent as belonging to a certain group (either because they were deliberately removed by the data publisher, or because they were not present in the first place). However, we can try to restore group distributions by analyzing publicly available data such as statistical surveys, polls etc. (Chertov and Tavrov 2015). Using expert judgments about these data, we can build a fuzzy model of a group in a form of a fuzzy inference system (FIS) that, for each respondent, gives her membership grade in the group under consideration. A distribution constructed this way can violate group anonymity as discussed above.
Expert judgments often are not a reliable source of fuzzy rules that constitute the main part of any FIS. Sometimes, it is hard even to properly identify attributes necessary to include into a model of a group, let alone determine particular fuzzy rules. In this work, we propose an evolutionary based method of building the fuzzy model using thirdparty data. We also describe a memetic algorithm for solving the task of anonymizing the obtained distribution. This algorithm seeks minimal distortion in the microfile, and at the same time ensures that group anonymity cannot be violated.
Related work
Data anonymity

individual anonymity means that a single respondent is unidentifiable within a given dataset;

group anonymity means that information about a group of respondents cannot be used to violate sensitive features of appropriate distributions.
Methods for providing individual anonymity are discussed in the field of privacypreserving data publishing (Fung et al. 2010; Wong and Fu 2010). A plenty of methods have been proposed over the years, some of which are randomization (Evfimievski 2002), microaggregation (DomingoFerrer and MateoSanz 2002), data swapping (Fienberg and McIntyre 2005), differential privacy (Dwork 2006), etc. A comprehensive overview of recent developments in the field can be found in Sowmyarani and Srinivasan (2012) and Rashid and Yasin (2015).
For the first time, the problem of violating data group anonymity, i.e., anonymity not of individual respondents, but of groups thereof, was introduced in the context of providing group anonymity in Chertov and Tavrov (2010). It was shown that group anonymity can be violated by analyzing outliers of a so called quantity signal \({\mathbf {q}} = \left( q_1, q_2, \ldots , q_{l_p}\right)\), where each \(q_k\), \(k = 1,2,\ldots , l_p\), stands for a number of respondents belonging to a given group (e.g., group of military personnel, or group of nuclear scientists) in a given submicrofile, whose total number is \(l_p\). A submicrofile is a subset of microfile records sharing the same property, such as region of work. In Chertov and Tavrov (2010), it was argued that outliers in a quantity signal that corresponds to the regional distribution of military personnel can be used to disclose locations of (potentially classified) military bases.
In Chertov and Tavrov (2012), the concept of a quantity signal has been taken further by introducing a concentration signal \({\mathbf {c}} = \left( c_1, c_2, \ldots , c_{l_p}\right)\), where each \(c_k\), \(k = 1,2,\ldots , l_p\), is obtained by dividing the corresponding \(q_k\) by a total number of records in a corresponding submicrofile. The concentration signal can be used to violate anonymity of groups when absolute numbers of respondents are not sufficient. For instance, as was argued in Chertov and Tavrov (2012) using scientists as an example, extreme ratios of scientists working in a given region could potentially give away the location of a classified research center.
In general, group anonymity can be violated by analyzing such sensitive properties of quantity and concentration signals as (Chertov 2010, p. 77) outliers (almost always a sensitive feature of any distribution), certain statistical features and trends (especially in the case when the quantity signal represents an ordered sequence of numbers), cycles or periods (especially when the quantity signal represents a time series), or frequency spectrum.
In certain practical applications, when the groups are defined in terms of specific attributes (such as a group of military personnel, which is defined by a special attribute uniquely identifying a respondent as a military enlisted), it is possible to protect group anonymity by removing this attribute from the original dataset before publishing. Being a crude solution by itself, it is still not applicable in a number of cases, when it is possible to build an approximation of a group, i.e., define a set of records in the dataset such that its quantity or concentration signal is sufficiently similar to the original one so that it is possible to violate anonymity of the group in question.
Taking into consideration uncertain and imprecise nature of statistical datasets, it was proposed in Chertov and Tavrov (2015) to violate group anonymity with the help of a fuzzy model of a group.
In Chertov and Tavrov (2014), a method for providing group anonymity based on memetic computing was proposed. This method enables us to modify the quantity (or concentration) signal in order to mask its outliers, and at the same time tries to minimize distortion introduced in the dataset. In Tavrov (2015), this algorithm was adapted to work with the fuzzy models proposed in Chertov and Tavrov (2015).
In the next subsection, we will briefly review the concept of fuzzy inference, which is necessary for discussing fuzzy models of groups of respondents.
Fuzzy inference
The concept of a fuzzy set was first introduced in Zadeh (1965). A fuzzy set A in a universal set X is a class, in which a point \(x\in X\) may have a grade of membership in the interval \(\left[ 0, 1\right]\). Each fuzzy set A is characterized by a membership function \(\mu _A: X \rightarrow \left[ 0, 1\right]\), which associates with each \(x\in X\) a real number in the interval \(\left[ 0, 1\right]\) considered as the “grade of membership” of x in A.
Fuzzy sets constitute a core of linguistic variables (Zadeh 1975). An ordinary variable is characterized by a triple \(\left( X, U, R\left( X, u\right) \right)\), in which X is the name of the variable, U is the universe of discourse, u is a generic name for the elements of U, and \(R\left( X, u\right)\) is a subset of U, which represents a restriction on the values of u imposed by X. A fuzzy variable differs from the ordinary one in that R is a fuzzy subset of U, which represents a fuzzy restriction on the values of u imposed by X.
A linguistic variable differs from an ordinary numerical variable in that its values are not numbers but words or sentences in a natural or artificial language. It is formally characterized by a quintuple \(\left( {\mathcal {X}}, T\left( {\mathcal {X}}\right) , U, G, M\right)\), in which \({\mathcal {X}}\) is the name of the variable; \(T\left( {\mathcal {X}}\right)\) denotes the termset of \({\mathcal {X}}\)—the set of names of linguistic values of \({\mathcal {X}}\), with each value being a fuzzy variable denoted generically by X and ranging over a universe of discourse U, which is associated with the base variable u; G is a syntactic rule for generating the names, X, of values of \({\mathcal {X}}\); and M is a semantic rule for associating with each X its meaning, \(M\left( X\right)\), which is a fuzzy subset of U. The meaning, \(M\left( X\right)\), of a term X is defined to be the restriction, \(R\left( X\right)\), on the base variable u, which is imposed by the fuzzy variable named X. For example, we can consider a linguistic variable named Number, which is associated with the finite termset \(T\left( \text {Number}\right) = \text {few} + \text {several} + \text {many}\), where \(+\) denotes union, and in which each term represents a restriction on the values of u in the universe of discourse \(U = 1 + 2 + \cdots + 10\).
In Chertov and Tavrov (2015), there has been proposed an expertbased procedure for building fuzzy model of a given group to be protected in a form of a fuzzy inference system (Klir and Yuan 1995), i.e., a system which employs expert knowledge in the form of fuzzy rules for making inferences. Such a fuzzy model can be then thought of as a fuzzy classifier that assigns to a given respondent a certain grade of membership in the group.
One of the biggest challenges in creating a fuzzy model of a group is coming up with a comprehensive and complete set of rules. When the number of input variables is relatively big, the total number of consistent fuzzy rules can grow beyond a point when it is all but impossible to use subjective expert knowledge to formalize them.
In some cases, the problem is not only that of defining proper fuzzy rules, but of defining, which variables to account for in the antecedents. For instance, in the case of building a fuzzy model of a group of military personnel, the choice needs to be made as to what microfile attributes need to be considered to make an accurate classification of a given respondent as a military person. In many practical tasks, there is no way of knowing this beforehand, so appropriate efficient search algorithms should be applied, such as evolutionary algorithms.
Evolutionary approach to building fuzzy rules
Evolutionary algorithms are heuristic generateandtest algorithms that mimic biological evolution by natural selection (Eiben and Smith 2015, p. 5). The task of creating a fuzzy rule set that enables us to violate group anonymity is a complex one, therefore utilizing evolutionary algorithms is a suitable approach to solving this problem.
Historically, application of evolutionary and, in particular, genetic algorithms to evolving rulebased systems was first proposed in Holland (1976) in the context of learning classifier systems. Such systems were described (Eiben and Smith 2015, p. 108) as a framework for studying learning in condition:action rule based systems, using genetic algorithms as the method for the discovery of new rules.
Over the years, evolutionary algorithms have been proposed for evolving fuzzy rules as well. For instance, in Ishibuchi et al. (1995, 1999), there was proposed an evolutionary algorithm for evolving fuzzy classifiers, i.e., rule based systems with fuzzy rules for solving classification tasks. In such systems, consequents (right parts) of the rules in the form (3) are labels of classes of interest rather than linguistic variables.
The task of evolving fuzzy rules for violating group anonymity can be viewed as a task of subgroup discovery, which is defined (Wrobel 1997) as the task of finding interesting subgroups in a population of individuals, where interestingness is defined as distributional unusualness with respect to a certain property of interest. Subgroup discovery represents (Jesus et al. 2007) a form of supervised inductive learning, in which, given a set of data and a property of interest to the user, an attempt is made to locate subgroups that are statistically most interesting for the user.
Since the subgroups discovered in data need to be of a more explanatory nature (interpretability of the extracted knowledge for the final user is a crucial aspect), a fuzzy approach (Jesus et al. 2007) for a subgroup discovery process, which considers linguistic variables in descriptive fuzzy rules, is a good approach to take.
It is important to make a distinction between subgroup discovery and the task of classification, because Carmona et al. (2014) subgroup discovery attempts to describe knowledge by data while a classifier attempts to predict the target value for new data to incorporate in the model. In the context of a fuzzy model of a group of respondents, whose anonymity needs to be violated, we are more interested in the classification side. However, many ideas from the field of subgroup discovery can provide useful insight, as will be shown in the paper. An overview of recent developments in the field of subgroup discovery can be found in Atzmueller (2015). Evolutionary algorithms for subgroup discovery are discussed in Carmona et al. (2014).
In general, there can be distinguished two approaches to evolving rulebased systems: Michigan approach (ValenzuelaRendón 1991) and Pittsburgh approach (Smith 1980). In the first case, each individual in the evolutionary algorithm population corresponds to a single rule. In the second case, each individual is a complete model, i.e., the whole set of rules.
In the extraction of rules for the subgroup discovery task, the Michigan approach is more suited because (Jesus et al. 2007) the objective is to find a reduced set of rules, in which the quality of each rule is evaluated independently of the rest, and it is not necessary to evaluate jointly the set of rules. Moreover, the computation load of the Pittsburgh approach is typically much higher (Ishibuchi et al. 1999, p. 616).
Rules used for describing a subgroup differ in their ability to describe an interesting subgroup, which is measured by a certain quality measure. In general, quality measures can be grouped (Freitas 1999) into objective and subjective measures. Since subjective measures involve experts for evaluating rules, we will focus only on objective measures that are datadriven, and don’t involve expert judgment. A comprehensive overview of quality measures can be found in Lavrač et al. (1999).
However, for the task of violating anonymity of a group of respondents with the help of fuzzy rules in terms of disclosing outliers in the quantity signal, quality measures described in the literature are not suitable. We are interested in cumulative classification properties of fuzzy rules. In other words, we allow ourselves for a certain degree of misclassifications, as long as outliers in the quantity signal obtained with the help of the fuzzy rules correspond to the ones in the original quantity signal. In this work, we propose a novel quality measure that takes this into account.
We also propose a version of an evolutionary algorithm for building a fuzzy model of a group as a set of fuzzy rules, which differs from the ones described in the literature in the quality measure used for evaluating fuzzy rule. The fuzzy model evolved using such an algorithm can be used for violating group anonymity in terms of disclosing outliers in the quantity signal.
Group anonymity basics
To set a stage for discussing the fuzzy model of a group, we will first introduce some basic notation.
General group anonymity definitions
Let us define microdata as the data about certain respondents presented in a form of a depersonalized microfile \({\mathbf {M}}\) (i.e., a microfile without identifiers). Each record \({\mathbf {r}}^{\left( i\right) }\), \(i = 1,2,\ldots , \rho\), in this microfile contains values of several attributes \(w_j\), \(j = 1,2,\ldots , \eta\). Let us denote by \({\mathbf {w}}_j\) the set of all the values of \(w_j\).
There are two types of attributes of the microfile necessary to define a group. Let \(w_{v_j}\), \(j = 1,2,\ldots , l\), denote vital microfile attributes. These attributes represent those characteristics of records that enable us to determine whether they belong to a group or not. Let us define a vital value combination V as an element of the Cartesian product \({\mathbf {w}}_{v_1} \times {\mathbf {w}}_{v_2} \times \cdots \times {\mathbf {w}}_{v_l}\). Let us denote a set of vital value combinations by \({\mathbf {V}} = \left\{ V_1, \ldots , V_{l_v}\right\}\). We will call records whose attribute values belong to \({\mathbf {V}}\) vital records. We will denote vital records by \({\mathbf {r}}_v^{\left( i\right) }\), \(i = 1,2,\ldots , \rho _v\).
Let \(w_p\), \(p \ne v_j \forall j\) denote a parameter microfile attribute. This attribute determines values, over which we should take the distribution of a group defined by the vital attributes. A parameter value P can be defined as a value of the parameter attribute, i.e., \(P \in {\mathbf {w}}_p\). Let us denote a set of parameter values by \({\mathbf {P}} = \left\{ P_1, \ldots , P_{l_p}\right\}\). By their nature, parameter values enable us to divide \({\mathbf {M}}\) into several submicrofiles \({\mathbf {M}}_1, \ldots , {\mathbf {M}}_{l_p}\). Each submicrofile \({\mathbf {M}}_k\) contains \(\rho _k\) records, \(k = 1,2,\ldots , l_p\), \(\sum _k \rho _k = \rho\). All the records in a certain submicrofile \({\mathbf {M}}_k\) share the same parameter value \(P_k\).
A word of caution is in order. Throughout this paper, we will assume that if \({\mathbf {M}}\) contains several attributes that can be concatenated to form a single parameter attribute, they will be concatenated.
We will call all the other attributes \(w_{b_j}\), \(j = 1,2,\ldots , 1, t\), \(b_j \ne p\), \(b_j \ne v_i \forall i, j\), basic attributes. Obviously, \(t = \eta  l  1\).
The group of records \(G\left( {\mathbf {V}}, {\mathbf {P}}\right)\), whose distribution needs to be masked when providing group anonymity, can be determined by the values of the vital and parameter attributes. We will denote the distribution of G, whose sensitive features need to be protected, by \(\Omega \left( {\mathbf {M}}, G\right)\). In consistency with existing literature, we will call this distribution the goal representation of a group. Throughout this paper, we will limit ourselves to a particular goal representation most widely used in practice called the quantity signal. This signal is denoted by \({\mathbf {q}} = \left( q_1, q_2, \ldots , q_{l_p}\right)\), where each \(q_k\), \(k = 1,2,\ldots , l_p\), stands for a number of records in \({\mathbf {M}}_k\) that belong to G, i.e., whose vital attribute values belong to \({\mathbf {V}}\).
Quantity signal and its sensitive features
As pointed out before, when providing group anonymity, it is necessary to protect sensitive features of the goal representation under consideration. In this work, we will consider such sensitive features of a quantity signal as its outliers. Outliers of a quantity signal might attract attention to parameter submicrofiles that are supposed to be indistinguishable (sites of military bases, classified research centers etc.).
By outliers of a quantity signal, we will understand its values that are statistically inconsistent with the rest of the signal. There have been proposed several approaches to determining outliers in a given dataset. According to the American National Standard of the American Society of Mechanical Engineers ASME PTC 19.1 (ASME 2013, p. 78), two tests are in common usage, the Thompson \(\tau\) Technique (Thompson 1935) and the Grubbs Method (Grubbs 1969). In this work, we propose to use the Modified Thompson \(\tau\) Technique (MTTT) as the method recommended by ASME (2013, p. 79) for identifying suspected outliers. This method is based on the Student’s tdistribution (Student 1908), which is most applicable in situations when the sample size is small, which is typically the case with the quantity signals.
 1.Calculate sample mean and sample standard deviation:where \(m_{{\mathbf {q}}}\) is the number of elements in \({\mathbf {q}}\).$$\begin{aligned} \overline{{\mathbf {q}}} = \frac{1}{m_{{\mathbf {q}}}}\sum _{i = 1}^{m_{{\mathbf {q}}}}q_i, \quad \sigma _{{\mathbf {q}}} = \sqrt{\frac{\sum _{i=1}^{m_{{\mathbf {q}}}}\left( q_i  \overline{{\mathbf {q}}}\right) ^2}{m_{{\mathbf {q}}}  1}}, \end{aligned}$$(4)
 2.For each signal value \(q_i\), \(i = 1,2,\ldots , m_{{\mathbf {q}}}\), calculate absolute value of its deviation from \(\sigma _{{\mathbf {q}}}\) as$$\begin{aligned} d_i = \left q_i  \overline{{\mathbf {q}}} \right . \end{aligned}$$(5)
 3.Calculate \(\tau\) according towhere \(t_{\alpha /2}\) is the critical Student’s t value (Student 1908) based on significance level \(\alpha\) and \(m_{{\mathbf {q}}}  2\) degrees of freedom.$$\begin{aligned} \tau = \frac{t_{\alpha /2}\cdot \left( m_{{\mathbf {q}}}  1\right) }{\sqrt{m_{{\mathbf {q}}}}\sqrt{m_{{\mathbf {q}}}  2 + t_{\alpha /2}^2}}, \end{aligned}$$(6)
 4.
If there is such i that \(d_i > \tau \sigma _{{\mathbf {q}}}\), then \(q_i\) is the outlier. In this case, we need to remove \(q_i\) from the signal and return to step 1. If \(d_i \le \tau \sigma _{{\mathbf {q}}}\) for all i, the algorithm stops.

the median, which can be interpreted as the “middle” value of a signal and is estimated by$$\begin{aligned} M_{{\mathbf {q}}} = \left\{ \begin{array}{ll} q_{\left( m_{{\mathbf {q}}} + 1\right) /2}, &{}\quad m_{{\mathbf {q}}} \text { is odd}\\ \frac{q_{m_{{\mathbf {q}}}/2} + q_{m_{{\mathbf {q}}}/2 + 1}}{2}, &{}\quad m_{{\mathbf {q}}} \text { is even} \end{array}\right. \end{aligned}$$(7)

the pseudostandard deviation, which can be defined based on the interquartile range (IQR):where \(q_{0.75}\) (\(q_{0.25}\)) is the upper (lower) quartile. If \(m_{{\mathbf {q}}}\) is even, the upper (lower) quartile is the median of the largest (smallest) \(\frac{m_{{\mathbf {q}}}}{2}\) observations. If the \(m_{{\mathbf {q}}}\) is odd, the upper (lower) quartile is the median of the largest (smallest) \(\frac{m_{{\mathbf {q}}} + 1}{2}\) observations.$$\begin{aligned} s_{ps{\mathbf {q}}} = \frac{q_{0.75}  q_{0.25}}{1.349}, \end{aligned}$$(8)
In this work, we will use the MTTT as described above, where estimates (7) and (8) are used in place of estimates (4).
Typically, a set of outliers yielded by MTTT contains signal elements that typically would not be considered as outliers by an expert. Moreover, in some practical cases, not all outliers need to be masked. E.g., when there is a well known military base associated with a particular signal element, masking a corresponding outlier will distort the data and make it obvious that the primary data have been tampered with. Therefore, in the context of providing group anonymity, it is necessary for an expert to revise the set of outliers as determined by MTTT.
Let us denote by \(OUT\left( {\mathbf {q}}\right)\) the set of indexes of \({\mathbf {q}}\) that correspond to outliers yielded by MTTT. Let us denote by \(OUT_e\left( {\mathbf {q}}\right) \subseteq OUT\left( {\mathbf {q}}\right)\) the subset of indexes of \({\mathbf {q}}\) obtained by excluding from \(OUT\left( {\mathbf {q}}\right)\) those indexes, which an expert considers as not important for the task at hand. For brevity, we will also denote by \(OUT_e^{\prime }\left( {\mathbf {q}}\right)\) the relative complement of \(OUT_e\left( {\mathbf {q}}\right)\) with respect to \(\left\{ 1, 2, \ldots , l_p\right\}\).
The task of providing group anonymity

disclosure risk is low or at least adequate to importance of information being protected;

both original and protected microfile data, when analyzed, yield sufficiently similar results;

the cost of transforming the data is acceptable.
In this paper, by the TPGA, we will understand the task of modifying the microfile in such a way that it is no longer possible to determine outliers in the quantity signal, and at the same time introduce as little distortion as possible in the process.
The easiest “solution” to the TPGA is to recode vital values or remove some of the vital attributes, so that it is impossible to restore the original quantity signal. However, this approach satisfies only one out of three properties stated above, namely, it is easy to carry out. At the same time, this simplistic approach only gives an impression of reducing the disclosure risk. As we will demonstrate later, if an adversary has access to appropriate thirdparty data, sensitive features of the group distribution can be violated under several conditions.
Therefore, even if we choose to remove the vital attributes (or otherwise modify them), we will still need to perform additional microfile modifications in order to properly protect anonymity of a given group.
Auxiliary microfiles

attributes \(w_{j_1}, \ldots , w_{j_n}\) are replaced by a single harmonized attribute \(w^H_{j_1}\);

several values of the \(j{\rm {th}}\) attribute \(w_j^{(i_1)}, \ldots , w_j^{(i_n)} \in {\mathbf {w}}_j\), \(j \in \left. \left\{ v_k\right\} \right _{k = 1,2,\ldots , l} \cup \left. \left\{ b_k\right\} \right _{k = 1,2,\ldots , t}\), are replaced by a single value \(w_j^{H\left( i_1\right) }\) of the \(j{\rm{th}}\) harmonized attribute, which may or may not be equal to any of the values in \({\mathbf {w}}_j\).

records in \(\tilde{{\mathbf {M}}}\) and in \({\mathbf {M}}\) are drawn from sufficiently similar distributions;

\(\tilde{{\mathbf {M}}}\) contains auxiliary vital attributes that have the same values and interpretation as the vital attributes in \({\mathbf {M}}\). Auxiliary vital attributes can be used to determine auxiliary vital records, whose total number is \(\tilde{\rho }_v\). In addition, vital and auxiliary vital records (as well as the records that are not vital or auxiliary vital, respectively) are drawn from sufficiently similar distributions;

\({\mathbf {M}}\) and \(\tilde{{\mathbf {M}}}\) can be transformed into their harmonized versions \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\), so that their basic attributes are identical both in terms of values and their interpretation. More precisely, \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\) contain harmonized basic attributes \(w^H_{b_j}\), \(j = 1,2,\ldots ,t^H\);

value combinations of attributes \(w^H_{b_j}\), \(j = 1,2,\ldots ,t^H\), can be used to determine membership grades \(\mu _{G}\left( {\mathbf {r}}^{H\left( i\right) }\right)\) of each record \({\mathbf {r}}^{H\left( i\right) } \in {\mathbf {M}}^H\), \(i = 1,2,\ldots , \rho\), in a group G, whose anonymity needs to be violated;

an adversary has access to \(\tilde{{\mathbf {M}}}\).
It is worth noting that it is not required to harmonize parameter attribute \(w_p\) in the original microfile or its analogy \(\tilde{w}_p\) in the auxiliary one. Throughout this paper, we will without loss of generality assume that \(w_p\) and \(\tilde{w}_p\) remain intact during the harmonization process.
The auxiliary quantity signal \({\mathbf {q}}^{aux}\) doesn’t have to be close in a numerical sense to the original quantity signal \({\mathbf {q}}\)—it is only required that outliers in \({\mathbf {q}}^{aux}\) correspond to those ones in \({\mathbf {q}}\).
Fuzzy rules in a fuzzy model of a group
In order to construct the auxiliary quantity signal as defined by (9), we need to calculate membership grades \(\mu _{G}\left( {\mathbf {r}}^{H\left( i\right) }\right)\) of each microfile record \({\mathbf {r}}^{H\left( i\right) } \in {\mathbf {M}}^H\), \(i = 1,2,\ldots ,\rho\). In general, this can be done using appropriate fuzzy rules.
Each linguistic variable \(L_j\) in the fuzzy rules, \(j = 1, 2, \ldots , t^H\), corresponds to the attribute \(w^H_{b_j}\), \(j = 1,2,\ldots , t^H\), in the harmonized microfile (\(\tilde{{\mathbf {M}}}^H\) or \({\mathbf {M}}^H\)). It has several values \(LL_j^k\), \(k = 1, 2, \ldots , l_{L_j}\), with their membership functions denoted by \(\mu _{LL_j^k}\). In addition, each linguistic variable by default has a value \(LL_j^0\) with the membership function \(\mu _{LL_j^0} \equiv 1\). If \(A_{ij} = LL_j^0\) is present in a fuzzy rule \(R_i\), it means that the actual value of attribute \(w^H_{b_j}\) is discarded. As pointed out in Ishibuchi et al. (1999), in this way we can obtain fuzzy rules of different generalization capacity.
For each linguistic variable, we can define a range \(\left[ l\left( L_j\right) , u\left( L_j\right) \right]\) of acceptable values of a corresponding base variable. All the records from \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\), whose values of attributes \(w^H_{b_j}\) lie outside the specified ranges, \(j = 1, 2, \ldots , t^H\), need to be removed. In order not to complicate the notation, we will further on assume that \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\) denote microfiles that contain only those records, whose attribute values lie inside corresponding ranges, unless specified otherwise. Similarly, we will further on assume that values \(\rho\) and \(\tilde{\rho }\) denote the total number of records in \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\), respectively, where \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\) denote either original microfiles or microfiles with records whose attribute values belong to specified ranges, depending on the context.
We say that a record \({\mathbf {r}}\) verifies the antecedent part of \(R_i\) if \(APC^{\alpha }\left( {\mathbf {r}}, R_i\right) > 0\), and that it is covered by \(R_i\) if additionally \({\mathbf {r}} \in G\).
In the context of violating group anonymity in terms of disclosing outliers in the auxiliary quantity signal, we are interesting in cumulative classification properties of the fuzzy rules. In other words, we allow ourselves for a certain degree of misclassifications, as long as outliers in the auxiliary quantity signal obtained with the help of the fuzzy rules correspond to the ones in the original quantity signal.

a fuzzy rule should have reasonable discriminative capability:which means that rule \(R_i\) classifies as belonging to the group G a disproportionally bigger number of auxiliary vital records than auxiliary records in general. We will introduce a discriminative factor defined by$$\begin{aligned} \frac{\sum _{\tilde{{\mathbf {r}}}\in G} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }{\tilde{\rho }_v} > \frac{\sum _{\tilde{{\mathbf {r}}}\in \tilde{{\mathbf {M}}}^H} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }{\tilde{\rho }}, \end{aligned}$$(13)$$\begin{aligned} DF\left( R_i\right) = \frac{\sum _{\tilde{{\mathbf {r}}}\in G} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }{\tilde{\rho }_v}  \frac{\sum _{\tilde{{\mathbf {r}}}\in \tilde{{\mathbf {M}}}^H} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }{\tilde{\rho }}, \end{aligned}$$(14)

a fuzzy rule should have reasonable relative confidence:which means that \(R_i\) incorrectly classifies no more than \(\frac{\sum _{\tilde{{\mathbf {r}}}\in G} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }{\gamma }\) records as belonging to G, where \(\gamma\) will be called the relative confidence threshold. We will introduce the relative confidence factor defined by$$\begin{aligned} \frac{\sum _{\tilde{{\mathbf {r}}}\in G} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }{\sum _{\tilde{{\mathbf {r}}}\in \overline{G}} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) } \ge \gamma , \end{aligned}$$(15)$$\begin{aligned} RCF\left( R_i\right) = \frac{\sum _{\tilde{{\mathbf {r}}}\in G} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }{\sum _{\tilde{{\mathbf {r}}}\in \overline{G}} APC^{\alpha }\left( \tilde{{\mathbf {r}}}, R_i\right) }. \end{aligned}$$(16)
It can be recognized that the minuend from (14) is a fuzzy version of a wellknown quality measure called support, and the subtrahend is a fuzzy version of another quality measure called coverage (Lavrač et al. 2004). Support considers the number of examples satisfying both the antecedent and the consequent parts of the rule, whereas coverage measures the percentage of examples covered on average by one rule.
It can also be recognized that (16) resembles the quality measure called confidence introduced in Jesus et al. (2007). However, our version differs in the denominator. Classically, the division is performed over the sum of the degree of membership of all the records that verify the antecedent part of this rule, whereas in our version we consider only those records that verify the antecedent part of the rule and don’t belong to G. In our view, this makes interpretation of this quality measure more tractable, because it can be easily assessed how many respondents the rule classifies incorrectly, in relative terms.
In a fuzzy model of a group, each rule \(R_i\) needs to have quality measures with the following properties: \(DF\left( R_i\right) > 0\), \(RCF\left( R_i\right) \ge \gamma\). In this case, we will reduce misclassifications, and thereby obtain a more suitable auxiliary quantity signal.
As it was mentioned earlier, due to complicated interrelations between different rules in the rule base, it is virtually impossible to construct the rule base from scratch using only expert knowledge. In sections to follow, we will present an appropriately tailored evolutionary algorithm for solving this task.
Adequacy of the fuzzy model of a group
In this section, we will briefly discuss possible tests for evaluating adequacy of the fuzzy model of the group described above. By adequacy of the fuzzy model we will consider its ability to correctly determine outliers in the quantity signal, i.e., how similar are the outliers in the original and auxiliary quantity signals. It therefore seems natural to evaluate model adequacy using tests designed to evaluate accuracy of classifiers.
Let \(X = {\mathbb {R}}^n\) be the multidimensional pattern space under investigation, each element \({\mathbf {x}}\in X\) of which belongs to one of the two classes from the set \(Y = \left\{ C_1, C_2\right\}\). Let \(P_{XY}\) be the unknown joint distribution over \(X \times Y\). Let us be given a classifier \(f: X \rightarrow Y\) that maps each pattern \({\mathbf {x}}\in X\) to a certain class. Let \(\epsilon = E_{XY}\left[ f\left( x\right) \ne y\right]\) be the classifier error, where E is the expectation operator.
The sum of values of (18) is m. Let us denote by e the number of incorrectly classified patterns. Then, \(\sum _i Z_{ii} = m  e\).
Guidelines for the interpretation of MB in terms of the strength of evidence in favor of \(H_1\) against \(H_0\)
MB  \(<\)0  0–1  1–3  3–5  \(>\)5 

\(H_1\) strength  Negative  Bare mention  Positive  Strong  Decisive 
In the context of evaluating the adequacy of the fuzzy model of a given group, the pattern space has to be taken as a set of parameter values: \(X = {\mathbf {P}}\). Class \(C_1\) contains those parameter values that correspond to outliers in \({\mathbf {q}}\), \(C_2\) contains all the other parameter values.

some of the outliers in \({\mathbf {q}}\) don’t have a correspondence in \({\mathbf {q}}^{aux}\), i.e., we cannot violate anonymity of some of the outliers (type II errors). We will call such outliers undisclosed outliers;

some of the outliers in \({\mathbf {q}}^{aux}\) don’t have a correspondence in \({\mathbf {q}}\), i.e., the fuzzy rules introduce additional outliers not supported by real data (type I errors). We will call such outliers false outliers.

\(TP = \left OUT_e\left( {\mathbf {q}}\right) \cap OUT_e\left( {\mathbf {q}}^{aux}\right) \right\);

\(FP = \left OUT_e\left( {\mathbf {q}}\right) \cap OUT_e^{\prime }\left( {\mathbf {q}}^{aux}\right) \right\);

\(FN = \left OUT_e^{\prime }\left( {\mathbf {q}}\right) \cap OUT_e\left( {\mathbf {q}}^{aux}\right) \right\);

\(TN = \left OUT_e^{\prime }\left( {\mathbf {q}}\right) \cap OUT_e^{\prime }\left( {\mathbf {q}}^{aux}\right) \right\).
General approach to applying fuzzy rules to violating group anonymity
 1.
Harmonization Choose a microfile \({\mathbf {M}}\) and determine a group G of records, whose distribution should be disclosed. Choose an auxiliary microfile \(\tilde{{\mathbf {M}}}\) that satisfies all the conditions given earlier. Perform harmonization of \({\mathbf {M}}\) and \(\tilde{{\mathbf {M}}}\) and obtain harmonized microfiles \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\) that have identical attributes with two exceptions: parameter attributes in both harmonized microfiles may not be identical, and \(\tilde{{\mathbf {M}}}^H\) contains auxiliary vital attributes, whereas \({\mathbf {M}}^H\) has vital attributes removed.
 2.
Input Variables Identification For each linguistic variable \(L_j\) corresponding to a basic harmonized attribute \(w^H_{b_j}\), \(j = 1,2,\ldots , t^H\), define a range of values of its base variable \(\left[ l\left( L_j\right) , u\left( L_j\right) \right]\). Remove from \({\mathbf {M}}^H\) and \(\tilde{{\mathbf {M}}}^H\) records whose values of attributes \(w^H_{b_j}\) lie outside the specified ranges, \(j = 1,2, \ldots , t^H\). Use expert judgment to determine the fuzzy values \(LL_j^k\) for each linguistic variable \(L_j\), \(j = 1,2,\ldots , t^H\), \(k = 1,2, \ldots , l_{L_j}\), defined by appropriate membership functions denoted by \(\mu _{LL_j^k}\).
 3.
Evolution Use the evolutionary algorithm to evolve fuzzy rules for violating anonymity of G in \({\mathbf {M}}^H\) based on the data from \(\tilde{{\mathbf {M}}}^H\). To reduce the number of undisclosed and false outliers, select only those rules R, for which \(DF\left( R\right) > 0\) and \(RCF\left( R\right) \ge \gamma\), and whose support is greater than a predefined value \(\kappa\). To reduce computational overhead, remove rules that are more specific versions of other rules in the set, i.e., for each pair of rules \(R_i\) and \(R_j\), if \(\forall k \quad A_{ik} \ne A_{jk} \rightarrow A_{ik} = LL_k^0\), remove \(A_j\). Using the fuzzy rules obtained, assign membership grades to all the records in \({\mathbf {M}}^H\), uniting the results in the fuzzy sense.
 4.
Disclosing Outliers Construct the auxiliary quantity signal (9) and determine outliers in it.
Evolutionary algorithm for building the fuzzy model of a group
Outline of the evolutionary algorithm
In the proposed algorithm, whose outline corresponds to the outline presented in Ishibuchi et al. (1995), we perform evolution only at the level of fuzzy rules. This means that we do not perform any finetuning of membership functions of input variables. We choose this approach to preserve comprehensibility for humans of the fuzzy rules in the system.
 1.
Randomly generate initial population \({\mathbf {R}} = \left\{ R_i \right\}\) of \(\mu\) individuals, \(i = 1,2,\ldots , \mu\).
 2.
Calculate values of the fitness function for each individual: \(f\left( R_i\right)\), \(i = 1,2,\ldots , \mu\).
 3.
Check termination condition: if it is satisfied, stop; continue otherwise.
 4.
Select \(\lambda\) pairs of individuals and put them into set \({\mathbf {R}}^{\prime }\).
 5.
Recombine pairs of individuals from \({\mathbf {R}}^{\prime }\) with a recombination operator \(REC\left( R_i, R_j\right)\), \(i = 1,2,\ldots , \lambda\), \(j = \lambda + 1, \ldots , 2\cdot \lambda\). Put the offspring into set \({\mathbf {R}}^{\prime \prime }\).
 6.
Mutate individuals from \({\mathbf {R}}^{\prime \prime }\) with a mutation operator \(MUT\left( R_j\right)\), \(j = 1,2,\ldots , \lambda\).
 7.
Replace \(\lambda\) individuals from \({\mathbf {R}}\) that have the lowest fitness values with the mutated offspring.
 8.
Go to step 3.
Representation and fitness function
In this work, we treat each individual \(R_i \in {\mathbf {R}}\), \(i = 1,2,\ldots , \mu\), as a single rule in the fuzzy rule set being evolved. I.e., the whole population constitutes the whole fuzzy rule set, in full concordance with the Michigan approach.
Availability of values \(LL_j^0\), \(j = 1,2,\ldots , t^H\), in \(R_i\) enables us to evolve rules that don’t take into account values of the attribute \(w^H_{b_j}\). In other words, the evolutionary process can lead to obtaining more generalized rules.
Other algorithm parameters
Operator \(REC\left( R_{i_1}, R_{i_2}\right)\) should be a proper recombination operator for integer representation applied with a high probability \(p_c\) to two individuals \(R_{i_1}\) and \(R_{i_2}\) that yields two offspring individuals \(R_{j_1}\) and \(R_{j_2}\). Operator \(MUT\left( R\right)\) should be a proper mutation operator for integer representation applied with a low probability \(p_m\) to a single individual R that yields the mutated one \(R^{\prime }\).

we will choose tournament selection (Brindle 1981) as an efficient and easy to implement selection operator, with the tournament size 10;

we will create initial populations by randomly generating values of each fuzzy rule element \(R_{ij}\), \(i = 1,2,\ldots , \mu\), \(j = 1,2,\ldots , t^H\), from a uniform distribution on \(\left[ 0, l_{L_j}\right]\);

we will choose the number of generations N as a termination condition, i.e., we will terminate the algorithm after having obtained N consequent populations.
Memetic algorithm for protecting group distributions
General information
In previous sections, we have shown that the TPGA is a pressing one, and group distributions need to be protected even when vital attributes are removed from the microfile. In this section, we will discuss the memetic algorithm (MA) for solving the task of providing group anonymity. This algorithm was introduced in Chertov and Tavrov (2014), and we will heavily rely on that publication when presenting the algorithm here.
We will assume that the data publisher decides to remove vital attributes from the microfile. As pointed out before, to provide group anonymity, we need to mask outliers in an auxiliary quantity signal obtained using appropriate fuzzy rules.
 1.
Prepare a (depersonalized) microfile \({\mathbf {M}}\) representing data to be anonymized.
 2.
Define groups of respondents \(G_i\left( {\mathbf {V}}_i, {\mathbf {P}}_i\right)\), whose quantity signals need to be masked, \(i = 1,2,\ldots , k\).
 3.For each i from 1 to k:
 (a)
Build the quantity signal \({\mathbf {q}}_i\) for \(G_i\).
 (b)
Obtain fuzzy models of \(G_i\) using the evolutionary algorithm.
 (c)
Build the auxiliary quantity signal \({\mathbf {q}}^{aux}_i\) for \(G_i\) using the obtained fuzzy models, and the corresponding crisp auxiliary quantity signal \({\mathbf {q}}^{aux}_{\text {crisp}_i}\).
 (d)
Compare two signals and determine whether there is risk of violating group anonymity in terms of disclosing their outliers.
 (e)
If there is such risk, define the modifying transformation \(A: {\mathbf {q}}^{aux}_{\text {crisp}_i} \left( {\mathbf {M}}, G_i\right) \rightarrow {\mathbf {q}}^{aux*}_{\text {crisp}_i}\left( {\mathbf {M}}^{*}, G_i\right)\), obtain the modified crisp auxiliary quantity signal \({\mathbf {q}}^{aux*}_{\text {crisp}_i}\), and hence the modified microfile \({\mathbf {M}}^{*}\).
 (a)
 4.
Prepare the modified microfile \({\mathbf {M}}^{*}\) for publishing.
In order to modify the auxiliary quantity signal for a given group in a given microfile, we need to physically alter some of the values in the microfile, more precisely, alter parameter values for certain records. To preserve the number of records with a particular parameter value, the records have to be altered in pairs, which can be interpreted as swapping the records between submicrofiles. One record needs to belong to the fuzzy model of a group, and another needs not to.
As mentioned before, to solve the TPGA means not only to modify the auxiliary quantity signal, but also to introduce as little distortion into the microfile as possible. To this end, the records being swapped have to be close to each other is some sense. In this work, we will apply the influential metric (Chertov 2010) to determine the degree of similarity between two microfile records. This metric is defined in terms of so called influential attributes, i.e., those ones whose distribution is important for further researches using microfile data. In this work, we will assume that influential attributes are the same as the basic harmonized attributes.
Preserving data utility from the minimal data distortion point of view is a task of high complexity and dimensionality, therefore, it is a good idea to use MAs (Moscato 1989) to solve the TPGA. MAs are typically implemented as evolutionary algorithms with local search procedures (Eiben and Smith 2015, p. 173). New applications of MAs to solving complex optimization tasks can be found in Kumar et al. (2014).
Outline of the algorithm
 1.
Create population P of \(\mu\) individuals, apply to them local search operator S.
 2.
Calculate fitness function \(f\left( {\mathbf {x}}\right)\) for each individual \({\mathbf {x}} \in P\).
 3.
Check termination condition. It if holds, stop, otherwise, go to 4.
 4.
Select \(\lambda\) pairs of parents.
 5.
Apply recombination operator R to each parent pair.
 6.
Apply mutation operator M to each of offspring. Put the offspring into \(P^{\prime }\).
 7.
Apply local search operator S to each individual \({\mathbf {x}} \in P^{\prime }\).
 8.
Calculate fitness function \(f\left( {\mathbf {x}}\right)\) for each individual \({\mathbf {x}} \in P^{\prime }\).
 9.
Select \(\mu\) individuals from \(P \cup P^{\prime }\), put them into P in place of current ones.
 10.
Go to 3.
In the algorithm outline above, we made use of several symbols introduced earlier, but with a different meaning. We hope it will be understandable from the context, what symbols mean in each particular case.
 1.
The first column contains indexes \(u_{i1}\; \forall i = 1,2,\ldots , Q\) of submicrofiles to remove vital records from. The user has to define the set of such submicrofiles.
 2.
The third column contains indexes \(u_{i3} \; \forall i = 1,2,\ldots , Q\) of submicrofiles to add vital records to. The user has to define the set of such submicrofiles.
 3.
The second column contains indexes \(u_{i2} \; \forall i = 1,2,\ldots , Q\) of the records from \({\mathbf {M}}_{u_{i1}}\) to be removed.
 4.
The fourth column contains indexes \(u_{i4} \; \forall i = 1,2,\ldots , Q\) of the records from \({\mathbf {M}}_{u_{i3}}\) to be swapped with the ones defined by \(u_{i2}\).
By its nature, each individual U uniquely defines the modified quantity signal \({\mathbf {q}}^{*}\), and also determines the particular way of obtaining it, because each row in U defines a particular pair of respondents to be swapped. Thereby, each U defines a complete solution to the TPGA at hand.

a submicrofile index i can occur in the first column of U not more than \(q_i\) times;

each pair \(\langle u_{i1}, u_{i2}\rangle\) or \(\langle u_{i3}, u_{i4}\rangle \; \forall i=1,2,\ldots , Q\) cannot occur in U more than once.
Other terms of the fitness function can be chosen depending on the TPGA at hand.
In this work, we use the following recombination operator \(R\left( U_{i_1}, U_{i_2}\right)\). It generates two random crossover points \(k_1\in \left[ 0, Q_{i_1}\right]\) and \(k_2\in \left[ 0, Q_{i_2}\right]\), splits each parent at appropriate points, exchanges the tails between them, and thus creates the offspring. This operator has to be applied with a high probability \(p_c\).
 1.
\(M_1\) is a swap mutation operator (Syswerda 1991) applied with a small probability \(p_{m_1}\) to the first column of U. Each pair \(\left\langle u_{i1}, u_{i2} \right\rangle\) needs to be preserved \(\forall i = 1,2,\ldots , Q\).
 2.
\(M_2\) is also a swap mutation operator applied with a small probability \(p_{m_2}\) to the third column of U. Each pair \(\left\langle u_{i3}, u_{i4} \right\rangle\) needs to be preserved \(\forall i = 1,2,\ldots , Q\).
 3.
\(M_3\) is a random resetting mutation operator (Eiben and Smith 2015, p. 43) applied with a small probability \(p_{m_3}\) to the second column of U.
 4.
\(M_4\) is a random resetting mutation operator applied with a small probability \(p_{m_4}\) to the fourth column of U.
 1.
Carry out steps 2–4 \(\forall i = 1,2,\ldots , Q\).
 2.
Generate a uniformly distributed number \(r\in \left[ 0, 1\right]\).
 3.
If \(r \le p_{mem}\), assign to \(u_{i4}\) the index of a record from \({\mathbf {M}}_{u_{i3}}\) closest to the record defined by \(u_{i2}\) from \({\mathbf {M}}_{u_{i1}}\) in terms of (25). Otherwise, assign to \(u_{i2}\) the index of a record from \({\mathbf {M}}_{u_{i1}}\) closest to the record defined by \(u_{i4}\) from \({\mathbf {M}}_{u_{i3}}\) in terms of (25).
 4.
Go to step 2.
Other MA components, such as selection, initialization, termination, population size etc. should be chosen individually for each TPGA to be solved.
Results
Problem definition and microfile harmonization
To illustrate ideas developed in this work, we decided to set a task of violating anonymity of a group of regionally distributed military personnel in the U.S. Outliers in quantity signals representing such a distribution might point to sites of military facilities, some of which might potentially be classified.
We decided to choose the 1 % sample microfile of the American Community Survey (ACS) conducted in 2013 available from the IPUMSInternational Project (Ruggles et al. 2010) as the microfile \({\mathbf {M}}\) we would like to violate group anonymity in. This microfile contains \(\rho = 1{,}380{,}924\) records.
The microfile contains attributes Place of work: state, 1980 onward and Place of work: PUMA, 2000 onward (where PUMA stands for Public Use Microdata Area), that, if concatenated, give a unique code of a PUMA where a respondent works. We decided to replace these attributes with a single one called Place of work by concatenating the values of the attributes for each microfile record. The newly obtained attribute plays the role of the parameter attribute for our task.
The microfile also contains \(l = 1\) vital attribute Occupation, SOC classification (where SOC stands for the 2010 Standard Occupational Classification system), which enables us to uniquely identify all the military personnel \({\mathbf {M}}_v\) in the microfile, \(\rho _v = 5{,}519\).
We decided to choose the 5 % sample microfile of the 2000 U.S. Census also available from the IPUMSInternational Project (Ruggles et al. 2010) as the auxiliary microfile \(\tilde{{\mathbf {M}}}\). This microfile contains \(\tilde{\rho } = 6{,}309{,}848\) records. Since this microfile also contains attributes Place of work: state, 1980 onward and Place of work: PUMA, 2000 onward, we decided to replace them with the Place of work attribute in the same way as described above.

records in \(\tilde{{\mathbf {M}}}\) and in \({\mathbf {M}}\) are drawn from sufficiently similar distributions under assumption that demographics of respondents in both microfiles haven’t changed much over 13 years;

\(\tilde{{\mathbf {M}}}\) contains an auxiliary vital attribute Occupation, SOC classification, identical to the vital attribute in \({\mathbf {M}}\) in terms of military occupations. Vital records in \({\mathbf {M}}\) and auxiliary vital ones in \(\tilde{{\mathbf {M}}}\) are drawn from sufficiently similar distributions under assumption that demographics of military personnel haven’t changed much over 13 years. There are \(\tilde{\rho } = 19{,}149\) auxiliary vital records in \(\tilde{{\mathbf {M}}}_v\);

\({\mathbf {M}}\) and \(\tilde{{\mathbf {M}}}\) contain almost identical attributes, with the exception of several technical ones. In our example, we performed the following harmonization:

we replaced the Occupation, SOC classification attribute in both microfiles with a new one Military Personnel, which has only two values, 0 and 1. The value 1 was assigned only to those records that had one of the values of attribute Occupation, SOC classification presented in Table 2;

we removed all attributes from both microfiles except for Military Personnel, Place of work, and \(t^H = 13\) basic harmonized attributes, which we consider to be useful for building a fuzzy model of a group.

Values of the Occupation, SOC classification attribute that correspond to the value 1 of the harmonized attribute Military Personnel
Attribute value  Interpretation 

551,010  Military Officer Special and Tactical Operations Leaders 
552,010  FirstLine Enlisted Military Supervisors 
553,010  Military Enlisted Tactical Operations and 
Air/Weapons Specialists and Crew Members  
559,830  Military, Rank Not Specified 
Basic harmonized attributes used in the practical example
Index  Name  Type  Values 

\(b_1\)  Age  O  000—Less than 1 year old, \(1\dots 130\)—1 to 130 years, 135—135 
\(b_2\)  Educational attainment [general version]  C  00—N/A or no schooling, 01—Nursery school to grade 4, 02—Grade 5, 6, 7, or 8, 03—Grade 9, 04—Grade 10, 05—Grade 11, 06—Grade 12, 07—1 year of college, 08—2 years of college, 09—3 years of college, 10—4 years of college, 11—5+ years of college 
\(b_3\)  Sex  C  1—Male, 2—Female 
\(b_4\)  Race [general version]  C  1—White, 2—Black/Negro, 3—American Indian or Alaska Native, 4—Chinese, 5—Japanese, 6—Other Asian or Pacific Islander, 7—Other race, nec, 8—Two major races, 9—Three or more major races 
\(b_5\)  Usual hours worked per week  O  00—N/A, \(01\ldots 98\)—1 to 98 h worked per week, 99—99 (Topcode) 
\(b_6\)  Hispanic origin [general version]  C  0—Not Hispanic, 1—Mexican, 2—Puerto Rican, 3 —Cuban, 4—Other, 9—Not Reported 
\(b_7\)  Marital status  C  1—Married, spouse present, 2—Married, spouse absent, 3—Separated, 4—Divorced, 5—Widowed, 6—Never married/single 
\(b_8\)  Means of transportation to work  C  00—N/A, 10—Auto, truck, or van, 11—Auto, 12—Driver, 13—Passenger, 14—Truck, 15—Van, 20—Motorcycle, 30—Bus or streetcar, 31—Bus or trolley bus, 32—Streetcar or trolley car, 33—Subway or elevated, 34—Railroad, 35—Taxicab, 36—Ferryboat, 40—Bicycle, 50—Walked only, 60—Other, 70—Worked at home 
\(b_9\)  Time of departure for work  O  0000—N/A, other values report the time usually leaving for work last week (12:01 a.m. is coded as 0001, and 11:59 p.m. is coded as 2359) 
\(b_{10}\)  Travel time to work  O  000—N/A, other values are amounts of time, in minutes, it took to get to work last week 
\(b_{11}\)  Weeks worked last year, intervalled  C  0—N/A, 1—1–13 weeks, 2—14–26 weeks, 3—27–39 weeks, 4—40–47 weeks, 5—48–49 weeks, 6—50–52 weeks 
\(b_{12}\)  Total personal income  O  A 7digit numeric code reporting each respondent’s total pretax personal income or losses from all sources for the previous year 
\(b_{13}\)  Speaks English  C  0—N/A (Blank), 1 —Does not speak English, 2—Yes, speaks English..., 3—Yes, speaks only English, 4—Yes, speaks very well, 5—Yes, speaks well, 6 —Yes, but not well, 7—Unknown, 8—Illegible 
Input variables identification
Ranges of acceptable for each linguistic variable in the practical example
Name of \(L_j\)  \(l\left( L_j\right)\)  \(u\left( L_j\right)\) 

Age  18  45 
Educational attainment [general version]  1  11 
Sex  1  2 
Race [general version]  1  2 
Usual hours worked per week  0  100 
Hispanic origin [general version]  0  9 
Marital status  1  6 
Means of transportation to work  0  70 
Time of departure for work  1  2359 
Travel time to work  1  119 
Weeks worked last year, intervalled  1  6 
Total personal income  0  200,000 
Speaks English  2  5 
After having removed all the records, whose basic harmonized attribute values don’t belong to the specified ranges, we obtained the microfiles \({\mathbf {M}}^H\) with \(\rho = 565,243\), \(\rho _v = 3,992\), and \(\tilde{{\mathbf {M}}}^H\) with \(\tilde{\rho } = 3,205,478\), \(\tilde{\rho }_v = 14,263\).
 1.Variable \(L_1\) has 5 fuzzy values:

Young, with the membership function$$\begin{aligned} \mu _{A_{1,1}}\left( x\right) = PIMF\left( x, 7.05, 15.40, 22.50, 27.18\right) ; \end{aligned}$$

Middle Aged 1, with \(\mu _{A_{1,2}}\left( x\right) = GAUSSMF\left( x, 2.0, 27.5\right) ;\)

Middle Aged 2, with \(\mu _{A_{1,3}}\left( x\right) = GAUSSMF\left( x, 2.0, 32.5\right) ;\)

Middle Aged 3, with \(\mu _{A_{1,4}}\left( x\right) = GAUSSMF\left( x, 2.0, 37.5\right) ;\)

Old, with \(\mu _{A_{1,5}}\left( x\right) = PIMF\left( x, 37.85, 42.50, 47.51, 54.84\right)\).

 2.Variable \(L_2\) has 2 fuzzy values:

Low, with \(\mu _{A_{2,1}}\left( x\right) = TRAPMF\left( x, 1, 1, 8, 10\right) ;\)

High, with \(\mu _{A_{2,2}}\left( x\right) = TRAPMF\left( x, 8, 10, 11, 11\right)\).

 3.Variable \(L_3\) has 2 fuzzy values:

Male, with \(\mu _{A_{3,1}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad x = 1\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

Female, with \(\mu _{A_{3,2}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad x = 2\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

 4.Variable \(L_4\) has 2 fuzzy values:

White, with \(\mu _{A_{4,1}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad x = 1\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

Black, with \(\mu _{A_{4,2}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad x = 2\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

 5.Variable \(L_5\) has 3 fuzzy values:

Low, with \(\mu _{A_{5,1}}\left( x\right) = PIMF\left( x, 0.0, 0.0, 29.9, 40.3\right) ;\)

Medium, with \(\mu _{A_{5,2}}\left( x\right) = GAUSSMF\left( x, 2.5, 40.0\right) ;\)

High, with \(\mu _{A_{5,3}}\left( x\right) = PIMF\left( x, 40.2, 50.1, 100.0, 100.0\right)\).

 6.Variable \(L_6\) has 2 fuzzy values:

No, with \(\mu _{A_{6,1}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad x = 0\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

Yes, with \(\mu _{A_{6,2}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad 1 \le x \le 9\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

 7.Variable \(L_7\) has 2 fuzzy values:

Married, with \(\mu _{A_{7,1}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad 1 \le x \le 2\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

Not married, with \(\mu _{A_{7,2}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad 3 \le x \le 6\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

 8.Variable \(L_8\) has 3 fuzzy values:

Car, with \(\mu _{A_{8,1}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad 0 \le x \le 20\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

Public, with \(\mu _{A_{8,2}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad 30 \le x \le 36\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

Walked, with \(\mu _{A_{8,3}}\left( x\right) = \left\{ \begin{array}{ll} 1, &{}\quad 40 \le x \le 50\\ 0, &{}\quad \text {otherwise} \end{array} \right.\)

 9.Variable \(L_9\) has 3 fuzzy values:

Night, with \(\mu _{A_{9,1}}\left( x\right) = PIMF\left( x, 1, 1, 530, 630\right) ;\)

Morning, with \(\mu _{A_{9,2}}\left( x\right) = PIMF\left( x, 530, 630, 800, 900\right) ;\)

Day, with \(\mu _{A_{9,3}}\left( x\right) = PIMF\left( x, 800, 900, 2359, 2359\right)\).

 10.Variable \(L_{10}\) has 3 fuzzy values:

Little, with \(\mu _{A_{10,1}}\left( x\right) = PIMF\left( x, 1, 1, 10, 15\right) ;\)

Medium, with \(\mu _{A_{10,2}}\left( x\right) = PIMF\left( x, 10, 15, 30, 45\right) ;\)

Much, with \(\mu _{A_{10,3}}\left( x\right) = PIMF\left( x, 35, 45, 120, 120\right)\).

 11.Variable \(L_{11}\) has 2 fuzzy values:

Abnormal, with \(\mu _{A_{11,1}}\left( x\right) = TRAMPF\left( x, 1, 1, 5, 6\right) ;\)

Normal, with \(\mu _{A_{11,2}}\left( x\right) = TRAMPF\left( x, 5, 6, 6, 6\right)\).

 12.Variable \(L_{12}\) has 3 fuzzy values:

Low, with \(\mu _{A_{12,1}}\left( x\right) = PIMF\left( x, 0, 0, 9000, 12000\right) ;\)

Medium, with \(\mu _{A_{12,2}}\left( x\right) = PIMF\left( x, 9000, 12000, 70000, 90000\right) ;\)

High, with \(\mu _{A_{12,3}}\left( x\right) = PIMF\left( x, 70000, 90000, 200000, 200000\right)\).

We decided not to define values for variable \(L_{13}\). Its range of acceptable values was used to remove unacceptable records from the microfiles, but the attribute itself was not involved in the fuzzy rules evolved using the evolutionary algorithm.
Generating fuzzy rules by the evolutionary algorithm

the population size \(\mu\) was fixed at 100;

on each iteration, we replaced \(\lambda = 40\) worst fit individuals with the newly obtained by applying recombination and mutation operators;

we applied recombination operator with the probability \(p_c = 1.00\), and mutation operator with probability \(p_m = 0.05\);

we performed 10 separate runs of the evolutionary algorithm, each of which lasted for \(N = 100\) generations.
Fuzzy rules used in the example
R  DF  RCF  Support 

\(\left( 1, 0, 0, 0, 0, 0, 2, 3, 1, 1, 0, 2\right)\)  0.032  0.755  0.032 
\(\left( 1, 0, 0, 0, 3, 0, 0, 3, 1, 0, 0, 0\right)\)  0.031  0.787  0.031 
\(\left( 1, 0, 0, 0, 3, 1, 0, 3, 0, 0, 2, 1\right)\)  0.012  0.801  0.012 
\(\left( 1, 0, 0, 1, 0, 1, 0, 3, 1, 1, 1, 2\right)\)  0.010  0.781  0.010 
\(\left( 1, 0, 0, 1, 3, 0, 0, 3, 0, 0, 2, 1\right)\)  0.012  0.851  0.012 
\(\left( 1, 0, 1, 0, 0, 0, 0, 3, 1, 1, 0, 2\right)\)  0.034  0.840  0.034 
\(\left( 1, 0, 1, 0, 0, 0, 2, 3, 1, 1, 2, 0\right)\)  0.025  0.765  0.025 
\(\left( 1, 0, 1, 0, 3, 0, 2, 3, 2, 0, 0, 1\right)\)  0.018  0.931  0.018 
\(\left( 1, 0, 1, 0, 3, 1, 0, 3, 2, 0, 0, 1\right)\)  0.017  0.915  0.018 
\(\left( 1, 0, 1, 1, 0, 0, 0, 3, 1, 1, 2, 0\right)\)  0.025  0.754  0.026 
\(\left( 1, 0, 1, 1, 0, 0, 2, 3, 1, 0, 0, 2\right)\)  0.032  0.751  0.032 
\(\left( 1, 1, 0, 0, 3, 0, 2, 3, 2, 1, 0, 1\right)\)  0.018  0.951  0.018 
\(\left( 1, 1, 0, 0, 3, 1, 0, 3, 2, 0, 0, 1\right)\)  0.019  0.767  0.019 
\(\left( 1, 1, 1, 0, 3, 0, 0, 3, 2, 0, 2, 1\right)\)  0.009  1.876  0.009 
\(\left( 1, 1, 1, 0, 3, 0, 0, 3, 2, 1, 1, 1\right)\)  0.008  0.761  0.009 
\(\left( 1, 1, 1, 0, 3, 0, 2, 3, 0, 1, 2, 1\right)\)  0.010  1.325  0.010 
\(\left( 1, 1, 1, 0, 3, 0, 2, 3, 2, 1, 2, 0\right)\)  0.026  0.767  0.026 
\(\left( 1, 2, 1, 0, 0, 0, 0, 3, 1, 0, 0, 2\right)\)  0.002  0.914  0.002 
As we can see, all of these rules share one common characteristic, i.e., their value of variable \(L_8\) is Walked, which means that all the respondents considered by the fuzzy rules as military personnel walked to their work rather than used a car or other means of transportation. Judging from the values of other variables, we can make general conclusions that these respondents typically are young males with medium yearly income.
Disclosing outliers in the group distribution using evolved fuzzy rules
To demonstrate how the evolved fuzzy rules can be used to violate outliers in the quantity signal, we will first apply them to the auxiliary microfile, and then proceed to disclosing outliers in quantity signals obtained for the main microfile.

in case of the quantity signal, the number of military personnel working in a corresponding PUMA;

in case of the auxiliary quantity signal, the sum of all membership grades assigned to the respondents in a corresponding PUMA by the evolved fuzzy rules.
Results of applying the evolved fuzzy rules to the 2000 census microfile
State  Number of outliers in the quantity signal  Number of undisclosed outliers  Number of outliers in the auxiliary quantity Signal  Number of false outliers 

Alabama  4  3  1  0 
Alaska  2  0  2  0 
Arizona  4  1  3  0 
California  4  0  4  0 
Colorado  2  0  2  0 
Connecticut  1  0  1  0 
Florida  7  4  4  1 
Georgia  5  1  5  1 
Hawaii  1  0  1  0 
Illinois  2  1  1  0 
Kansas  3  2  1  0 
Kentucky  2  0  2  0 
Louisiana  4  2  2  0 
Maryland  2  1  1  0 
Mississippi  2  1  1  0 
Missouri  2  1  1  0 
New Jersey  3  1  2  0 
New York  2  0  2  0 
North Carolina  4  2  2  0 
Ohio  4  3  1  0 
Oklahoma  3  2  1  0 
Pennsylvania  4  2  4  2 
Rhode Island  1  0  1  0 
South Carolina  6  1  5  0 
Tennessee  3  3  0  0 
Texas  7  2  5  0 
Virginia  9  3  6  0 
Washington  5  2  3  0 
Total  98  38  64  4 
Let us now discuss the results of the application of the evolved fuzzy rules to the original microfile \({\mathbf {M}}^H\). In Fig. 2, we presented both quantity signal (solid line) and auxiliary quantity signal (dashed line) for the state of New York. Values \(i = 1,2,\ldots , 38\) over the x axis stand for the \(i\hbox {th}\) PUMA of the state of New York. The list of PUMAs circa 2013 can be found on the IPUMSInternational website (PUMAs 2010).
Results of applying the evolved fuzzy rules to the 2013 ACS microfile
State  Number of outliers in the quantity signal  Number of undisclosed outliers  Number of outliers in the auxiliary quantity signal  Number of false outliers 

Alabama  2  2  1  1 
Alaska  2  0  2  0 
Arizona  4  1  4  1 
California  3  1  2  0 
Colorado  2  0  2  0 
Connecticut  1  0  2  1 
Florida  7  5  3  1 
Georgia  7  3  4  0 
Hawaii  1  0  1  0 
Illinois  2  1  2  1 
Kansas  2  2  0  0 
Kentucky  2  1  1  0 
Louisiana  4  4  0  0 
Maryland  3  2  1  0 
Mississippi  1  0  1  0 
Missouri  2  2  0  0 
Nevada  1  0  1  0 
New Jersey  2  2  0  0 
New Mexico  2  2  0  0 
New York  2  0  2  0 
North Carolina  3  1  2  0 
Ohio  2  1  3  2 
Oklahoma  3  2  1  0 
South Carolina  4  1  3  0 
Texas  6  1  5  0 
Virginia  7  4  4  1 
Washington  4  1  3  0 
Total  81  39  50  8 
The tests (19), (20), and (22) based on the values of \({\mathbf {Z}}\) are as follows: \(PA = 0.930\), \(J = 0.775\), \(MB = 55.067\). The values of all the tests are lower than their counterparts calculated for the 2000 census data. The matter is that the fuzzy rules were evolved using 2000 census data. Nevertheless, presented values indicate high effectiveness of the evolved fuzzy rules and their good generalization abilities.
Results of protecting group distributions using memetic algorithm
As we’ve already discussed earlier, to mask the outliers in the signal, we need to reduce the values of the \(5\hbox {th}\) and the \(29\hbox {th}\) signal elements. We can achieve this task by imposing such fuzzy restrictions that lead the evolutionary process in the direction of obtaining signals, whose \(5\hbox {th}\) and \(29\hbox {th}\) signal values will not be greater than 2.
To simplify the matters, we considered all the basic attributes to be categorical ones with following parameters of (25): \(\gamma _k = 1 \; \forall k = 1,2,\ldots , 13\), \(\chi _1 = 1\), \(\chi _2 = 0\). The metric (25) defined this way shows the number of attribute values that need to be physically altered during one swap of the records between the submicrofiles.
We decided to apply tournament selection (Brindle 1981) as an efficient and easy to implement selection operator, with the tournament size 5. Other algorithm parameters were chosen as follows: \(\mu = 100\), \(\lambda = 40\), \(p_c = 1\), \(p_{m_1} = p_{m_2} = p_{m_3} = p_{m_4} = 0.001\), \(p_{mem} = 0.75\). We terminated the algorithm after having obtained 1000 consequent populations.
The population was initialized by randomly generating matrices with different numbers of rows. Elements of the first column were generated with probabilities proportional to the values of the corresponding elements of \({\mathbf {q}}\). Elements of the third column were generated with probabilities proportional to the total numbers of records in corresponding submicrofiles.
During the MA run, we applied linear fitness scaling in the form presented in Goldberg (1989, p. 79) to prevent premature convergence. We also multiplied the mutation probabilities by the factor of 10 whenever the standard deviation of the population fitness values dropped below 0.03.
We performed 10 runs of the MA. Among 1000 individuals obtained in the last generations of each run, 983 correspond to valid solutions of the TPGA in terms of masking outliers in the auxiliary quantity signal. In Fig. 3 (dashed line), we presented the solution with the lowest cumulative influential metric (25), namely, 53. This solution is valid because applying MTTT to it yields \(OUT\left( {\mathbf {q}}^{aux*}_{NY\ 2013}\right) = \left\{ 12, 35\right\}\). Since \(OUT_e\left( {\mathbf {q}}^{aux}_{NY\ 2013}\right) \cap OUT\left( {\mathbf {q}}^{aux*}_{NY\ 2013}\right) = \emptyset\), we can conclude that the memetic algorithm managed to successfully modify the auxiliary quantity signal by creating new outliers in the \(35\hbox {th}\) and \(12\hbox {th}\) signal elements and eliminating the real ones.
The mean cumulative metric (25) over all solutions that can be in a similar fashion viewed as valid is 62.518, i.e., we need to alter only \(\frac{62.518}{13\cdot 1,380,924} \approx 0.0003\) % of microfile attribute values in order to provide group anonymity.
Conclusions
In this work, we demonstrated that even if vital attributes are removed from the microfile, it does not necessarily follow that group anonymity is fully provided. Using an appropriately tailored evolutionary algorithm, it is possible to build up the fuzzy model of a group in the form of fuzzy rules that can violate group anonymity. We also discussed how memetic algorithms can be used to really provide group anonymity in a microfile at the cost of introducing only a small amount of distortion into the micro data.
Much work remains to be done. Several directions for future research include: enhancing the classification accuracy of the fuzzy rules and enhancing the memetic algorithm efficiency by choosing appropriate operators.
Declarations
Authors’ contributions
DT built evolutionary and memetic algorithms. OC proposed the modification of the fuzzy classifier . Both authors performed analysis of existing methods for solving the task of providing group anonymity. Both authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 ASME (2013) Test uncertainty: PTC 10.12013. New York NY. ASMEGoogle Scholar
 Atzmueller M (2015) Subgroup discovery. WIREs Data Min Knowl Discov 5:35–49View ArticleGoogle Scholar
 Brindle A (1981) Genetic algorithms for function optimization, PhD thesis. University of Alberta, Department of Computer ScienceGoogle Scholar
 Carmona CJ, González P, del Jesus MJ, Herrera F (2014) Overview on evolutionary subgroup discovery: analysis of the suitability and potential of the search performed by evolutionary algorithms. WIREs Data Min Knowl Discov 4:87–103View ArticleGoogle Scholar
 Chertov O (2010) Group methods of data processing. Lulu.com, RaleighGoogle Scholar
 Chertov O, Tavrov D (2010) Group anonymity. In: Huellermeier E, Kruse R, Hoffmann F (eds) Information processing and management of uncertainty in knowledgebased systems. Applications. Communications in computer and information science, vol 81. Springer, Berlin, pp 592–601Google Scholar
 Chertov O, Tavrov D (2015) Microfiles as a potential source of confidential information leakage. Studies in computational intelligence. In: Yager RR, Reformat MZ, Alajlan N (eds) Intelligent methods for cyber warfare, vol 563. Springer, Heidelberg, pp 87–114Google Scholar
 Chertov O, Tavrov D (2012) Providing group anonymity using wavelet transform. In: MacKinnon LM (ed) Data security and security data. Lecture notes in computer science, vol 6121. Springer, Berlin, pp 25–36Google Scholar
 Chertov O, Tavrov D (2014) Memetic algorithm for solving the task of providing group anonymity. In: Jamshidi M, Kreinovich V, Kacprzyk J (eds) Advanced trends in soft computing. Studies in fuzziness and soft computing, vol 312. Springer, Heidelberg, pp 281–292Google Scholar
 Chertov O, Pilipyuk A (2011) Statistical disclosure control methods for microdata. In: 2009 International symposium on computing, communication, and control. Proceedings of CSIT, vol 1. IACSIT Press, Singapore, pp 339–343Google Scholar
 Composition of PUMAs and SuperPUMAs in the 2000 Census and ACS/PRCS from 2005–2011. Minnesota Population Center. https://usa.ipums.org/usa/volii/2000pumas.shtml
 Composition of 2010 Based PUMAs Used in the ACS/PRCS Samples from 2012present. Minnesota Population Center. https://usa.ipums.org/usa/volii/pumas10.shtml
 Del Jesus MJ, González P, Herrera F, Mesonero M (2007) Evolutionary fuzzy rule induction process for subgroup discovery: a case study in marketing. IEEE Trans Fuzzy Syst 15(4):578–592View ArticleGoogle Scholar
 DomingoFerrer J, MateoSanz JM (2002) Practical dataoriented microaggregation for statistical disclosure control. IEEE Trans Knowl Data En 14(1):189–201View ArticleGoogle Scholar
 Dwork S (2006) Differential privacy. In: Bugliesi M, Preneel B, Sassone V, Wegener I (eds) Automata, languages and programming. Lecture notes in computer science, vol 4052. Springer, Berlin, pp 147–158Google Scholar
 Eiben AE, Smith JE (2015) Introduction to evolutionary computing. Springer, BerlinView ArticleGoogle Scholar
 Evfimievski A (2002) Randomization in privacy preserving data mining. ACM SIGKDD Expl Newsl 4(2):43–48View ArticleGoogle Scholar
 Fienberg S, McIntyre J (2005) Data swapping: variations on a theme by dalenius and reiss. J Off Stat 21(2):309–324Google Scholar
 Freitas AA (1999) On rule interestingness measures. In: Miles R, Moulton M, Bramer M (eds) Research and development in expert systems XV. Proceedings of ES98, the eighteenth annual international conference of the British Computer Society Specialist Group on Expert Systems, Cambridge, December 1998. Springer, Heidelberg, pp 147–158Google Scholar
 Fung BCM, Wang K, Chen R, Yu PS (2010) Privacypreserving data publishing: a survey of recent developments. ACM Comput Surv (CSUR) 42(4):1–53View ArticleGoogle Scholar
 Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. AddisonWesley, CrawfordsvilleGoogle Scholar
 Gordin MD (2003) A modernization of ‘peerless homogeneity’: the creation of russian smokeless gunpowder. Technol Cult 44:677–702View ArticleGoogle Scholar
 Grubbs FE (1969) Procedures for detecting outlying observations in samples. Technometrics 11:1–21View ArticleGoogle Scholar
 Holland JH (1976) Adaptation. In: Rosen R, Snell FM (eds) Progress in theoretical biology. Plenum, New York, pp 263–293View ArticleGoogle Scholar
 Ishibuchi H, Nozaki K, Yamamoto N, Tanaka H (1995) Selecting fuzzy ifthen rules for classification problems using genetic algorithms. IEEE Trans Fuzzy Syst 3:260–270View ArticleGoogle Scholar
 Ishibuchi H, Tomoharu N, Murata T (1999) Performance evaluation of fuzzy classifier systems for multidimensional pattern classification problems. IEEE Trans Syst Man Cybern 29(5):601–618View ArticleGoogle Scholar
 Kass RE, Raftery AE (1995) Bayes factors. J Am Stat Assoc 90(430):773–795View ArticleGoogle Scholar
 Klir GJ, Yuan B (1995) Fuzzy Sets and Fuzzy Logic. Theory and applications. Prentice Hall, Upper Saddle RiverGoogle Scholar
 Kumar S, Sharma VK, Kumari R (2014) Improved onlooker bee phase in artificial bee colony algorithm. Int J Comput Appl 90(6):31–39Google Scholar
 Lanzante JR (1996) Resistant, robust and nonparametric techniques for the analysis of climate data: theory and examples, including applications to historical radiosonde station data. Int J Climatol 16:1197–1226View ArticleGoogle Scholar
 Lavrač N, Kavšek B, Flach P, Todorovski L (2004) Subgroup discovery with cn2sd. J Mach Learn Res 5:153–188Google Scholar
 Lavrač N, Flach P, Zupan B (1999) Rule evaluation measures: a unifying view. In: Džeroski S, Flach P (eds) Inductive logic programming. Proceedings of the 9th international workshop, ILP99 Bled, Slovenia, June 24–27, 1999. Lecture notes in computer science, vol 1634. Springer, Heidelberg, pp 174–185Google Scholar
 Moscato P (1989) On evolution, search, optimization, genetic algorithms and martial arts: toward memetic algorithms. Technical Report C3P Rep. 826, Caltech Concurrent Computation ProgramGoogle Scholar
 Office of the Deputy under Secretary of Defense (2000) Base structure report (a summary of DoD’s Real Property Inventory) Fiscal Year 2001 Baseline. Washington, DC. Office of the Deputy under Secretary of DefenseGoogle Scholar
 Olivetti E, Greiner S, Avesani P (2015) Statistical independence for the evaluation of classifierbased diagnosis. Brain Inform 2:13–19View ArticleGoogle Scholar
 Olivetti E, Greiner S, Avesani P (2012) Induction in neuroscience with classification: issues and solutions. In: Langs G, Rish I, GrossWentrup M, Murphy B (eds) Machine learning and interpretation in neuroimaging. Lecture notes in computer science, vol 7263. Springer, Heidelberg, pp 42–50Google Scholar
 Pfitzmann A, Hansen M (2010) A terminology for talking about privacy by data minimization: anonymity, unlinkability, undetectability, unobservability, pseudonymity, and identity management, Version V0.34. http://dud.inf.tudresden.de/Anon_Terminology.shtml
 Rashid AH, Yasin NBM (2015) Privacypreserving data publishing: review. Int J Phys Sci 10(7):239–247View ArticleGoogle Scholar
 Ruggles S, Alexander JT, Genadek K, Goeken R, Schroeder, MB, Sobek M (2010) Integrated public use microdata series: version 5.0 [Machinereadable Database]. University of Minnesota, MinneapolisGoogle Scholar
 Smith SF (1980) A learning system based on genetic adaptive algorithms. PhD thesis, University of PittsburghGoogle Scholar
 Sowmyarani CN, Srinivasan GN (2012) Survey on recent developments in privacy preserving models. Int J Comput Appl 38(9):18–22Google Scholar
 Student (1908) The probable error of a mean. Biometrika 6(1):1–25Google Scholar
 Syswerda G (1991) Schedule optimization using genetic algorithms. In: Davis L (ed) Handbook of genetic algorithms. Van Nostrand Reinhold, New York, pp 332–349Google Scholar
 Syswerda G (1989) Uniform crossover in genetic algorithms. In: Schaffer JD (ed) Proceedings of the 3rd international conference on genetic algorithms. Morgan Kaufmann Publishers Inc., pp 2–9Google Scholar
 Tavrov D (2015) Memetic approach to anonymizing groups that can be approximated by a fuzzy inference system. In: Fuzzy information processing society (NAFIPS) held jointly with 2015 5th world conference on soft computing (WConSC), 2015 Annual conference of the North American, pp 1–6Google Scholar
 Thompson WR (1935) On a criterion for the rejection of observations and the distribution of the ratio of deviation to sample standard deviation. Ann Math Stat 6(4):214–219View ArticleGoogle Scholar
 Tishchenko V, Mladientsev M (1993) Dmitrii Ivanovich Mendeleyev, Yego Zhizn i Deyatelnost. Universitetskii Period 1861–1890 Gg. (In Russian). Nauka, MoskvaGoogle Scholar
 ValenzuelaRendón M (1991) The fuzzy classifier system: motivations and first results. In: Proceedings of parallel solving from nature (PPSN II), pp 330–334Google Scholar
 Wong RCW, Fu AWC (2010) Privacypreserving data publishing: an overview (Synthesis lectures on data management). Morgan and Claypool Publishers, San RafaelGoogle Scholar
 Wrobel S (1997) An algorithm for multirelational discovery of subgroups. In: Komorowski J, Zytkow J (eds) Principles of data mining and knowledge discovery. Proceedings of the first European symposium, PKDD ’97 Trondheim, Norway, June 24–27, 1997. Lecture notes in computer science, vol 1263. Springer, Heidelberg, pp 25–36Google Scholar
 Youden WJ (1950) Index for rating diagnostic tests. Cancer 3(1):32–35View ArticleGoogle Scholar
 Zadeh LA (1965) Fuzzy sets. Inform control 8:338–353View ArticleGoogle Scholar
 Zadeh LA (1975) The concept of a linguistic variable and its application to approximate reasoning–ii. Inf Sci 8:301–357View ArticleGoogle Scholar