Evolutionary approach to violating group anonymity using third-party data

In the era of Big Data, it is almost impossible to completely restrict access to primary non-aggregated statistical data. However, risk of violating privacy of individual respondents and groups of respondents by analyzing primary data has not been reduced. There is a need in developing subtler methods of data protection to come to grips with these challenges. In some cases, individual and group privacy can be easily violated, because the primary data contain attributes that uniquely identify individuals and groups thereof. Removing such attributes from the dataset is a crude solution and does not guarantee complete privacy. In the field of providing individual data anonymity, this problem has been widely recognized, and various methods have been proposed to solve it. In the current work, we demonstrate that it is possible to violate group anonymity as well, even if those attributes that uniquely identify the group are removed. As it turns out, it is possible to use third-party data to build a fuzzy model of a group. Typically, such a model comes in a form of a set of fuzzy rules, which can be used to determine membership grades of respondents in the group with a level of certainty sufficient to violate group anonymity. In the work, we introduce an evolutionary computing based method to build such a model. We also discuss a memetic approach to protecting the data from group anonymity violation in this case.

do that (patient's medical records), law enforcement agencies do that (person's IDs), employers do that (CVs), retail stores do that (personal discount cards), security does that (security cameras files), and so on and so forth. Problems of preserving privacy in such data are widely discussed within the field of privacy-preserving data publishing (Fung et al. 2010;Wong and Fu 2010). To great extent, appropriate protection implies removing identifiers (passport data, full name etc.), and distorting the data (e.g., values of certain characteristics are swapped between respondents or get noised) or suppressing them (e.g., data on elder people are grouped in a category of senior citizens).
At the same time, problems of protecting group distributions for certain categories of respondents remain unsolved. Let us consider a case when abnormal concentration of nuclear physicists on a specific territory reveals the site of a secret nuclear research facility. Of course, removing such attributes as Occupation or Industry seems to be a first choice. However, the risk of privacy violation remains high if there is information about where respondents pursued their higher education (e.g., National Institute for Nuclear Science and Technology for academic training in atomic energetics is situated in Saclay commune, France), or about where they lived (for instance, Dubna, Russian Federation, is a home to Joint Institute for Nuclear Research). Therefore, the task of protecting distributions for a certain group of respondents (which can be persons, households, enterprises etc.) with minimal distortion of primary statistical data is a pressing one.
There are numerous practical cases when we do not have attributes at our disposal that classify a respondent as belonging to a certain group (either because they were deliberately removed by the data publisher, or because they were not present in the first place). However, we can try to restore group distributions by analyzing publicly available data such as statistical surveys, polls etc. (Chertov and Tavrov 2015). Using expert judgments about these data, we can build a fuzzy model of a group in a form of a fuzzy inference system (FIS) that, for each respondent, gives her membership grade in the group under consideration. A distribution constructed this way can violate group anonymity as discussed above.
Expert judgments often are not a reliable source of fuzzy rules that constitute the main part of any FIS. Sometimes, it is hard even to properly identify attributes necessary to include into a model of a group, let alone determine particular fuzzy rules. In this work, we propose an evolutionary based method of building the fuzzy model using third-party data. We also describe a memetic algorithm for solving the task of anonymizing the obtained distribution. This algorithm seeks minimal distortion in the microfile, and at the same time ensures that group anonymity cannot be violated.

Data anonymity
Anonymity of a subject means (Pfitzmann and Hansen 2010) that it is not identifiable (uniquely characterized) within a set of subjects. There can be distinguished two kinds of anonymity: • group anonymity means that information about a group of respondents cannot be used to violate sensitive features of appropriate distributions.
Methods for providing individual anonymity are discussed in the field of privacy-preserving data publishing (Fung et al. 2010;Wong and Fu 2010). A plenty of methods have been proposed over the years, some of which are randomization (Evfimievski 2002), microaggregation (Domingo-Ferrer and Mateo-Sanz 2002), data swapping (Fienberg and McIntyre 2005), differential privacy (Dwork 2006), etc. A comprehensive overview of recent developments in the field can be found in Sowmyarani and Srinivasan (2012) and Rashid and Yasin (2015).
For the first time, the problem of violating data group anonymity, i.e., anonymity not of individual respondents, but of groups thereof, was introduced in the context of providing group anonymity in Chertov and Tavrov (2010). It was shown that group anonymity can be violated by analyzing outliers of a so called quantity signal q = q 1 , q 2 , . . . , q l p , where each q k , k = 1, 2, . . . , l p , stands for a number of respondents belonging to a given group (e.g., group of military personnel, or group of nuclear scientists) in a given submicrofile, whose total number is l p . A submicrofile is a subset of microfile records sharing the same property, such as region of work. In Chertov and Tavrov (2010), it was argued that outliers in a quantity signal that corresponds to the regional distribution of military personnel can be used to disclose locations of (potentially classified) military bases.
In Chertov and Tavrov (2012), the concept of a quantity signal has been taken further by introducing a concentration signal c = c 1 , c 2 , . . . , c l p , where each c k , k = 1, 2, . . . , l p , is obtained by dividing the corresponding q k by a total number of records in a corresponding submicrofile. The concentration signal can be used to violate anonymity of groups when absolute numbers of respondents are not sufficient. For instance, as was argued in Chertov and Tavrov (2012) using scientists as an example, extreme ratios of scientists working in a given region could potentially give away the location of a classified research center.
In general, group anonymity can be violated by analyzing such sensitive properties of quantity and concentration signals as (Chertov 2010, p. 77) outliers (almost always a sensitive feature of any distribution), certain statistical features and trends (especially in the case when the quantity signal represents an ordered sequence of numbers), cycles or periods (especially when the quantity signal represents a time series), or frequency spectrum.
In certain practical applications, when the groups are defined in terms of specific attributes (such as a group of military personnel, which is defined by a special attribute uniquely identifying a respondent as a military enlisted), it is possible to protect group anonymity by removing this attribute from the original dataset before publishing. Being a crude solution by itself, it is still not applicable in a number of cases, when it is possible to build an approximation of a group, i.e., define a set of records in the dataset such that its quantity or concentration signal is sufficiently similar to the original one so that it is possible to violate anonymity of the group in question.
Taking into consideration uncertain and imprecise nature of statistical datasets, it was proposed in Chertov and Tavrov (2015) to violate group anonymity with the help of a fuzzy model of a group.
In Chertov and Tavrov (2014), a method for providing group anonymity based on memetic computing was proposed. This method enables us to modify the quantity (or concentration) signal in order to mask its outliers, and at the same time tries to minimize distortion introduced in the dataset. In Tavrov (2015), this algorithm was adapted to work with the fuzzy models proposed in Chertov and Tavrov (2015).
In the next subsection, we will briefly review the concept of fuzzy inference, which is necessary for discussing fuzzy models of groups of respondents.

Fuzzy inference
The concept of a fuzzy set was first introduced in Zadeh (1965). A fuzzy set A in a universal set X is a class, in which a point x ∈ X may have a grade of membership in the interval [0, 1]. Each fuzzy set A is characterized by a membership function µ A : X → [0, 1] , which associates with each x ∈ X a real number in the interval [0, 1] considered as the "grade of membership" of x in A.
Fuzzy sets constitute a core of linguistic variables (Zadeh 1975). An ordinary variable is characterized by a triple (X, U , R(X, u)), in which X is the name of the variable, U is the universe of discourse, u is a generic name for the elements of U, and R(X, u) is a subset of U, which represents a restriction on the values of u imposed by X. A fuzzy variable differs from the ordinary one in that R is a fuzzy subset of U, which represents a fuzzy restriction on the values of u imposed by X.
A linguistic variable differs from an ordinary numerical variable in that its values are not numbers but words or sentences in a natural or artificial language. It is formally characterized by a quintuple (X , T (X ), U , G, M), in which X is the name of the variable; T (X ) denotes the term-set of X-the set of names of linguistic values of X, with each value being a fuzzy variable denoted generically by X and ranging over a universe of discourse U, which is associated with the base variable u; G is a syntactic rule for generating the names, X, of values of X; and M is a semantic rule for associating with each X its meaning, M(X), which is a fuzzy subset of U. The meaning, M(X), of a term X is defined to be the restriction, R(X), on the base variable u, which is imposed by the fuzzy variable named X. For example, we can consider a linguistic variable named Number, which is associated with the finite term-set T (Number) = few + several + many, where + denotes union, and in which each term represents a restriction on the values of u in the universe of discourse U = 1 + 2 + · · · + 10.
Linguistic variables can be used to formalize knowledge in form of fuzzy propositions. While each classical proposition (i.e., a sentence in some language) is required to be either true or false, the truth of fuzzy propositions is a matter of degree. The canonical form of the fuzzy proposition, p, is expressed (Klir and Yuan 1995) by the sentence where V is a linguistic variable with the base variable v defined on some universal set V, and F is a fuzzy set on V that represents a fuzzy predicate. Given a particular value of v, this value belongs to F with membership grade µ F (v). This membership grade is then interpreted as the degree of truth, T (p), of proposition p.
Of particular interest for the task of building fuzzy models of groups are conditional propositions (fuzzy rules), expressed by the canonical form (Klir and Yuan 1995) (1) p : V is F , where X and Y are linguistic variables with the base variables x and y whose values are in sets X and Y, respectively; A and B are fuzzy sets on X and Y, respectively. Antecedents (left parts) of fuzzy rules can contain more than one linguistic variable: where logical connective and can be interpreted as a proper fuzzy intersection (Zadeh 1965).
In Chertov and Tavrov (2015), there has been proposed an expert-based procedure for building fuzzy model of a given group to be protected in a form of a fuzzy inference system (Klir and Yuan 1995), i.e., a system which employs expert knowledge in the form of fuzzy rules for making inferences. Such a fuzzy model can be then thought of as a fuzzy classifier that assigns to a given respondent a certain grade of membership in the group.
One of the biggest challenges in creating a fuzzy model of a group is coming up with a comprehensive and complete set of rules. When the number of input variables is relatively big, the total number of consistent fuzzy rules can grow beyond a point when it is all but impossible to use subjective expert knowledge to formalize them.
In some cases, the problem is not only that of defining proper fuzzy rules, but of defining, which variables to account for in the antecedents. For instance, in the case of building a fuzzy model of a group of military personnel, the choice needs to be made as to what microfile attributes need to be considered to make an accurate classification of a given respondent as a military person. In many practical tasks, there is no way of knowing this beforehand, so appropriate efficient search algorithms should be applied, such as evolutionary algorithms.

Evolutionary approach to building fuzzy rules
Evolutionary algorithms are heuristic generate-and-test algorithms that mimic biological evolution by natural selection (Eiben and Smith 2015, p. 5). The task of creating a fuzzy rule set that enables us to violate group anonymity is a complex one, therefore utilizing evolutionary algorithms is a suitable approach to solving this problem.
Historically, application of evolutionary and, in particular, genetic algorithms to evolving rule-based systems was first proposed in Holland (1976) in the context of learning classifier systems. Such systems were described (Eiben and Smith 2015, p. 108) as a framework for studying learning in condition:action rule based systems, using genetic algorithms as the method for the discovery of new rules.
Over the years, evolutionary algorithms have been proposed for evolving fuzzy rules as well. For instance, in Ishibuchi et al. (1995Ishibuchi et al. ( , 1999, there was proposed an evolutionary algorithm for evolving fuzzy classifiers, i.e., rule based systems with fuzzy rules for solving classification tasks. In such systems, consequents (right parts) of the rules in the form (3) are labels of classes of interest rather than linguistic variables.
The task of evolving fuzzy rules for violating group anonymity can be viewed as a task of subgroup discovery, which is defined (Wrobel 1997) as the task of finding interesting subgroups in a population of individuals, where interestingness is defined as distributional unusualness with respect to a certain property of interest. Subgroup discovery (3) p : If X 1 is A 1 , and X 2 is A 2 , . . . , and X n is A n , then Y is B, represents (Jesus et al. 2007) a form of supervised inductive learning, in which, given a set of data and a property of interest to the user, an attempt is made to locate subgroups that are statistically most interesting for the user.
Since the subgroups discovered in data need to be of a more explanatory nature (interpretability of the extracted knowledge for the final user is a crucial aspect), a fuzzy approach (Jesus et al. 2007) for a subgroup discovery process, which considers linguistic variables in descriptive fuzzy rules, is a good approach to take.
It is important to make a distinction between subgroup discovery and the task of classification, because Carmona et al. (2014) subgroup discovery attempts to describe knowledge by data while a classifier attempts to predict the target value for new data to incorporate in the model. In the context of a fuzzy model of a group of respondents, whose anonymity needs to be violated, we are more interested in the classification side. However, many ideas from the field of subgroup discovery can provide useful insight, as will be shown in the paper. An overview of recent developments in the field of subgroup discovery can be found in Atzmueller (2015). Evolutionary algorithms for subgroup discovery are discussed in Carmona et al. (2014).
In general, there can be distinguished two approaches to evolving rule-based systems: Michigan approach (Valenzuela-Rendón 1991) and Pittsburgh approach (Smith 1980). In the first case, each individual in the evolutionary algorithm population corresponds to a single rule. In the second case, each individual is a complete model, i.e., the whole set of rules.
In the extraction of rules for the subgroup discovery task, the Michigan approach is more suited because (Jesus et al. 2007) the objective is to find a reduced set of rules, in which the quality of each rule is evaluated independently of the rest, and it is not necessary to evaluate jointly the set of rules. Moreover, the computation load of the Pittsburgh approach is typically much higher (Ishibuchi et al. 1999, p. 616).
Rules used for describing a subgroup differ in their ability to describe an interesting subgroup, which is measured by a certain quality measure. In general, quality measures can be grouped (Freitas 1999) into objective and subjective measures. Since subjective measures involve experts for evaluating rules, we will focus only on objective measures that are data-driven, and don't involve expert judgment. A comprehensive overview of quality measures can be found in Lavrač et al. (1999). However, for the task of violating anonymity of a group of respondents with the help of fuzzy rules in terms of disclosing outliers in the quantity signal, quality measures described in the literature are not suitable. We are interested in cumulative classification properties of fuzzy rules. In other words, we allow ourselves for a certain degree of misclassifications, as long as outliers in the quantity signal obtained with the help of the fuzzy rules correspond to the ones in the original quantity signal. In this work, we propose a novel quality measure that takes this into account.
We also propose a version of an evolutionary algorithm for building a fuzzy model of a group as a set of fuzzy rules, which differs from the ones described in the literature in the quality measure used for evaluating fuzzy rule. The fuzzy model evolved using such an algorithm can be used for violating group anonymity in terms of disclosing outliers in the quantity signal.

Group anonymity basics
To set a stage for discussing the fuzzy model of a group, we will first introduce some basic notation.

General group anonymity definitions
Let us define microdata as the data about certain respondents presented in a form of a depersonalized microfile M (i.e., a microfile without identifiers). Each record r (i) , i = 1, 2, . . . , ρ, in this microfile contains values of several attributes w j , j = 1, 2, . . . , η. Let us denote by w j the set of all the values of w j .
There are two types of attributes of the microfile necessary to define a group. Let w v j , j = 1, 2, . . . , l, denote vital microfile attributes. These attributes represent those characteristics of records that enable us to determine whether they belong to a group or not. Let us define a vital value combination V as an element of the Cartesian product w v 1 × w v 2 × · · · × w v l . Let us denote a set of vital value combinations by V = V 1 , . . . , V l v . We will call records whose attribute values belong to V vital records. We will denote vital records by r Let w p , p � = v j ∀j denote a parameter microfile attribute. This attribute determines values, over which we should take the distribution of a group defined by the vital attributes. A parameter value P can be defined as a value of the parameter attribute, i.e., P ∈ w p . Let us denote a set of parameter values by P = P 1 , . . . , P l p . By their nature, parameter values enable us to divide M into several submicrofiles M 1 , . . . , M l p . Each submicrofile M k contains ρ k records, k = 1, 2, . . . , l p , k ρ k = ρ. All the records in a certain submicrofile M k share the same parameter value P k .
A word of caution is in order. Throughout this paper, we will assume that if M contains several attributes that can be concatenated to form a single parameter attribute, they will be concatenated.
We will call all the other attributes w b j , j = 1, 2, . . . , The group of records G(V, P), whose distribution needs to be masked when providing group anonymity, can be determined by the values of the vital and parameter attributes. We will denote the distribution of G, whose sensitive features need to be protected, by �(M, G). In consistency with existing literature, we will call this distribution the goal representation of a group. Throughout this paper, we will limit ourselves to a particular goal representation most widely used in practice called the quantity signal. This signal is denoted by q = q 1 , q 2 , . . . , q l p , where each q k , k = 1, 2, . . . , l p , stands for a number of records in M k that belong to G, i.e., whose vital attribute values belong to V.

Quantity signal and its sensitive features
As pointed out before, when providing group anonymity, it is necessary to protect sensitive features of the goal representation under consideration. In this work, we will consider such sensitive features of a quantity signal as its outliers. Outliers of a quantity signal might attract attention to parameter submicrofiles that are supposed to be indistinguishable (sites of military bases, classified research centers etc.).
By outliers of a quantity signal, we will understand its values that are statistically inconsistent with the rest of the signal. There have been proposed several approaches to determining outliers in a given dataset. According to the American National Standard of the American Society of Mechanical Engineers ASME PTC 19.1 (ASME 2013, p. 78), two tests are in common usage, the Thompson τ Technique (Thompson 1935) and the Grubbs Method (Grubbs 1969). In this work, we propose to use the Modified Thompson τ Technique (MTTT) as the method recommended by ASME (2013, p. 79) for identifying suspected outliers. This method is based on the Student's t-distribution (Student 1908), which is most applicable in situations when the sample size is small, which is typically the case with the quantity signals.
Let the values of the quantity signal q be arranged in increasing order. To determine outliers in this signal, one needs to carry out the following steps: 1. Calculate sample mean and sample standard deviation: where m q is the number of elements in q. 2. For each signal value q i , i = 1, 2, . . . , m q , calculate absolute value of its deviation from σ q as 3. Calculate τ according to where t α/2 is the critical Student's t value (Student 1908) based on significance level α and m q − 2 degrees of freedom. 4. If there is such i that d i > τ σ q , then q i is the outlier. In this case, we need to remove q i from the signal and return to step 1. If d i ≤ τ σ q for all i, the algorithm stops.
Statistical characteristics (4) are not robust to the presence of outliers in a signal, so there have been proposed (Lanzante 1996) other characteristics: • the median, which can be interpreted as the "middle" value of a signal and is estimated by • the pseudo-standard deviation, which can be defined based on the interquartile range (IQR): where q 0.75 (q 0.25 ) is the upper (lower) quartile. If m q is even, the upper (lower) quartile is the median of the largest (smallest) m q 2 observations. If the m q is odd, the upper (lower) quartile is the median of the largest (smallest) m q +1 2 observations.
In this work, we will use the MTTT as described above, where estimates (7) and (8) are used in place of estimates (4).
Typically, a set of outliers yielded by MTTT contains signal elements that typically would not be considered as outliers by an expert. Moreover, in some practical cases, not all outliers need to be masked. E.g., when there is a well known military base associated with a particular signal element, masking a corresponding outlier will distort the data and make it obvious that the primary data have been tampered with. Therefore, in the context of providing group anonymity, it is necessary for an expert to revise the set of outliers as determined by MTTT.
Let us denote by OUT (q) the set of indexes of q that correspond to outliers yielded by MTTT. Let us denote by OUT e (q) ⊆ OUT (q) the subset of indexes of q obtained by excluding from OUT (q) those indexes, which an expert considers as not important for the task at hand. For brevity, we will also denote by OUT ′ e (q) the relative complement of OUT e (q) with respect to 1, 2, . . . , l p .

The task of providing group anonymity
To solve the task of providing group anonymity (TPGA), we need to modify the original microfile M in order obtain a new, protected one M * . Such modification needs to meet three conditions (Chertov and Pilipyuk 2011, p. 339): • disclosure risk is low or at least adequate to importance of information being protected; • both original and protected microfile data, when analyzed, yield sufficiently similar results; • the cost of transforming the data is acceptable.
In this paper, by the TPGA, we will understand the task of modifying the microfile in such a way that it is no longer possible to determine outliers in the quantity signal, and at the same time introduce as little distortion as possible in the process.
The easiest "solution" to the TPGA is to recode vital values or remove some of the vital attributes, so that it is impossible to restore the original quantity signal. However, this approach satisfies only one out of three properties stated above, namely, it is easy to carry out. At the same time, this simplistic approach only gives an impression of reducing the disclosure risk. As we will demonstrate later, if an adversary has access to appropriate third-party data, sensitive features of the group distribution can be violated under several conditions. Therefore, even if we choose to remove the vital attributes (or otherwise modify them), we will still need to perform additional microfile modifications in order to properly protect anonymity of a given group.

Auxiliary microfiles
Let us further on assume that all the vital attributes are removed from M. Let us denote by M H the harmonized version of M, which can be obtained from by means of two basic transformations: • attributes w j 1 , . . . , w j n are replaced by a single harmonized attribute w H j 1 ; of the jth harmonized attribute, which may or may not be equal to any of the values in w j .
Let us denote by M the auxiliary microfile with ρ records denoted by r, which has the following properties: • records in M and in M are drawn from sufficiently similar distributions; •M contains auxiliary vital attributes that have the same values and interpretation as the vital attributes in M. Auxiliary vital attributes can be used to determine auxiliary vital records, whose total number is ρ v . In addition, vital and auxiliary vital records (as well as the records that are not vital or auxiliary vital, respectively) are drawn from sufficiently similar distributions; • value combinations of attributes w H b j , j = 1, 2, . . . , t H , can be used to determine membership grades µ G r H (i) of each record r H (i) ∈ M H , i = 1, 2, . . . , ρ, in a group G, whose anonymity needs to be violated; • an adversary has access to M .
It is worth noting that it is not required to harmonize parameter attribute w p in the original microfile or its analogy w p in the auxiliary one. Throughout this paper, we will without loss of generality assume that w p and w p remain intact during the harmonization process.
If the conditions given above are met, it is possible to build a set of fuzzy rules to determine membership grades µ G r H (i) , i = 1, 2, . . . , ρ, of each record in a group. This set of rules can be interpreted as a fuzzy model of the group whose anonymity needs to be violated. This model enables us to construct an auxiliary quantity signal where M H j is the parameter submicrofile of M H , whose records share the same parameter value P j ; α is the group membership threshold used to cut off records that don't belong to G with a sufficiently high grade. Throughout this paper, we will use α = 0.5.
The auxiliary quantity signal q aux doesn't have to be close in a numerical sense to the original quantity signal q-it is only required that outliers in q aux correspond to those ones in q.

Fuzzy rules in a fuzzy model of a group
In order to construct the auxiliary quantity signal as defined by (9), we need to calculate membership grades µ G r H (i) of each microfile record r H (i) ∈ M H , i = 1, 2, . . . , ρ. In general, this can be done using appropriate fuzzy rules.
For the case of a fuzzy model of a group, such fuzzy rules can be presented in the following form: where R i , i = 1, 2, . . . , m, denotes the ith fuzzy rule, A ij denotes the value of the jth linguistic variable L j used in the ith fuzzy rule, G denotes the class of records that belong to a group.
Each linguistic variable L j in the fuzzy rules, j = 1, 2, . . . , t H , corresponds to the attrib- In addition, each linguistic variable by default has a value LL 0 j with the membership function µ LL 0 j is present in a fuzzy rule R i , it means that the actual value of attribute w H b j is discarded. As pointed out in Ishibuchi et al. (1999), in this way we can obtain fuzzy rules of different generalization capacity.
For each linguistic variable, we can define a range l L j , u L j of acceptable values of a corresponding base variable. All the records from M H and M H , whose values of attributes w H b j lie outside the specified ranges, j = 1, 2, . . . , t H , need to be removed. In order not to complicate the notation, we will further on assume that M H and M H denote microfiles that contain only those records, whose attribute values lie inside corresponding ranges, unless specified otherwise. Similarly, we will further on assume that values ρ and ρ denote the total number of records in M H and M H , respectively, where M H and M H denote either original microfiles or microfiles with records whose attribute values belong to specified ranges, depending on the context.
In what follows, we will make use of notation accepted in the subgroup discovery field. Let us define the antecedent part compatibility (Jesus et al. 2007) as the degree of compatibility between a record r and the antecedent part of R i as where µ A ij is the membership function of the fuzzy set A ij , denotes a proper fuzzy intersection. Throughout this paper, we will use arithmetic product as the fuzzy intersection.
To account for the group membership threshold alpha introduced in (9), we will further on use the following modification of (11): Then, we can say that where denotes fuzzy union (Zadeh 1965). In this work, we will use maximum function as the fuzzy union. We say that a record r verifies the antecedent part of R i if APC α (r, R i ) > 0, and that it is covered by R i if additionally r ∈ G.
In the context of violating group anonymity in terms of disclosing outliers in the auxiliary quantity signal, we are interesting in cumulative classification properties of the fuzzy rules. In other words, we allow ourselves for a certain degree of misclassifications, as long as outliers in the auxiliary quantity signal obtained with the help of the fuzzy rules correspond to the ones in the original quantity signal.
Therefore, we need to introduce quality measures that are different from the ones described in the literature: • a fuzzy rule should have reasonable discriminative capability: which means that rule R i classifies as belonging to the group G a disproportionally bigger number of auxiliary vital records than auxiliary records in general. We will introduce a discriminative factor defined by • a fuzzy rule should have reasonable relative confidence: which means that R i incorrectly classifies no more than r∈G APC α (r,R i ) γ records as belonging to G, where γ will be called the relative confidence threshold. We will introduce the relative confidence factor defined by It can be recognized that the minuend from (14) is a fuzzy version of a well-known quality measure called support, and the subtrahend is a fuzzy version of another quality measure called coverage (Lavrač et al. 2004). Support considers the number of examples satisfying both the antecedent and the consequent parts of the rule, whereas coverage measures the percentage of examples covered on average by one rule.
It can also be recognized that (16) resembles the quality measure called confidence introduced in Jesus et al. (2007). However, our version differs in the denominator. Classically, the division is performed over the sum of the degree of membership of all the .
records that verify the antecedent part of this rule, whereas in our version we consider only those records that verify the antecedent part of the rule and don't belong to G. In our view, this makes interpretation of this quality measure more tractable, because it can be easily assessed how many respondents the rule classifies incorrectly, in relative terms. In a fuzzy model of a group, each rule R i needs to have quality measures with the following properties: DF (R i ) > 0, RCF (R i ) ≥ γ. In this case, we will reduce misclassifications, and thereby obtain a more suitable auxiliary quantity signal.
Auxiliary quantity signal contains all the information necessary to violate group anonymity. On the other hand, to protect group anonymity, we need to use a signal that consists of crisp values representing numbers of respondents, not fuzzy degrees. Let us introduce a crisp auxiliary quantity signal: Values of (17) correspond to quantities of records in a corresponding microfile, which are assigned a membership grade greater than α. We will make use of the signal defined in this way when we will discuss the method for protecting group anonymity in one of the subsequent sections.
As it was mentioned earlier, due to complicated interrelations between different rules in the rule base, it is virtually impossible to construct the rule base from scratch using only expert knowledge. In sections to follow, we will present an appropriately tailored evolutionary algorithm for solving this task.

Adequacy of the fuzzy model of a group
In this section, we will briefly discuss possible tests for evaluating adequacy of the fuzzy model of the group described above. By adequacy of the fuzzy model we will consider its ability to correctly determine outliers in the quantity signal, i.e., how similar are the outliers in the original and auxiliary quantity signals. It therefore seems natural to evaluate model adequacy using tests designed to evaluate accuracy of classifiers.
Let X = R n be the multidimensional pattern space under investigation, each element x ∈ X of which belongs to one of the two classes from the set Y = {C 1 , C 2 }. Let P XY be the unknown joint distribution over X × Y . Let us be given a classifier f : X → Y that maps each pattern x ∈ X to a certain class. Let ǫ = E XY [f (x) � = y] be the classifier error, where E is the expectation operator.
Since in practical cases X is typically a set of finite size, ǫ can only be estimated. Let S = x 1 , y 1 , . . . , x m , y m , be the set of pairs drawn from P XY . Let us introduce the confusion matrix (Olivetti et al. 2012) where TP (true positive) is the number of patterns from S that belong to class C 1 , and for which f (x) = C 1 ; FP (false positive) is the number of patterns from S that belong to class C 1 , and for which f (x) = C 2 ; FN (false negative) is the number of patterns from S that belong to class C 2 , and for which f (x) = C 1 ; TN (true negative) is the number of patterns from S that belong to class C 2 , and for which f (x) = C 2 .
The sum of values of (18) is m. Let us denote by e the number of incorrectly classified patterns. Then, i Z ii = m − e.
The prediction accuracy is defined as When the number of patterns per class is not equal, a setting is called unbalanced. As was shown in Olivetti et al. (2012), test (19) is not suitable for unbalanced data. One of the tests suitable for unbalanced data is Youden's J statistic (Youden 1950): This test explicitly captures the type I and type II errors.
In Olivetti et al. (2015), there was proposed a Bayesian test of statistical independence between the results given by the classifier, on the one hand, and the true distribution P XY , on the other hand. This test also takes into account the unbalanced nature of the data and the size of the data set. Let us denote by H 0 the hypothesis that the results given by the classifier are statistically independent of the true distribution P XY . Let us also denote by H 1 the hypothesis that such results are statistically dependent. Then, let us denote by B the Bayes factor that measures the evidence of the data in favor of H 1 with respect to H 0 : where i j = i! j!(i−j)! ; t 1 and t 2 are non-negative integer parameters. The test for evaluating the classifier based on (21) is calculated by Guidelines for the interpretation of this test are given in Table 1 (Kass and Raftery 1995).
In the context of evaluating the adequacy of the fuzzy model of a given group, the pattern space has to be taken as a set of parameter values: X = P. Class C 1 contains those B(t 1 , t 2 ) = TP + FP + FN + TN + 1 (TP + FP + t 1 + 1)(FN + TN + t 2 + 1) The auxiliary quantity signal q aux can differ from q in two ways: • some of the outliers in q don't have a correspondence in q aux , i.e., we cannot violate anonymity of some of the outliers (type II errors). We will call such outliers undisclosed outliers; • some of the outliers in q aux don't have a correspondence in q, i.e., the fuzzy rules introduce additional outliers not supported by real data (type I errors). We will call such outliers false outliers.
Taking into consideration notation introduced earlier, elements of the confusion matrix (18) can be defined as follows:

General approach to applying fuzzy rules to violating group anonymity
In general, to violate anonymity of a certain group G in a microfile M in terms of disclosing outliers in its quantity signal, we need to proceed along the following steps: 1. Harmonization Choose a microfile M and determine a group G of records, whose distribution should be disclosed. Choose an auxiliary microfile M that satisfies all the conditions given earlier. Perform harmonization of M and M and obtain harmonized microfiles M H and M H that have identical attributes with two exceptions: parameter attributes in both harmonized microfiles may not be identical, and M H contains auxiliary vital attributes, whereas M H has vital attributes removed. 2. Input Variables Identification For each linguistic variable L j corresponding to a basic harmonized attribute w H b j , j = 1, 2, . . . , t H , define a range of values of its base variable l L j , u L j . Remove from M H and M H records whose values of attributes w H b j lie outside the specified ranges, j = 1, 2, . . . , t H . Use expert judgment to determine the fuzzy values LL k j for each linguistic variable L j , j = 1, 2, . . . , t H , k = 1, 2, . . . , l L j , defined by appropriate membership functions denoted by µ LL k j . 3. Evolution Use the evolutionary algorithm to evolve fuzzy rules for violating anonymity of G in M H based on the data from M H . To reduce the number of undisclosed and false outliers, select only those rules R, for which DF (R) > 0 and RCF (R) ≥ γ , and whose support is greater than a predefined value κ. To reduce computational overhead, remove rules that are more specific versions of other rules in the set, i.e., for each pair of rules R i and R j , if ∀k A ik � = A jk → A ik = LL 0 k , remove A j . Using the fuzzy rules obtained, assign membership grades to all the records in M H , uniting the results in the fuzzy sense. 4. Disclosing Outliers Construct the auxiliary quantity signal (9) and determine outliers in it. Tavrov and Chertov SpringerPlus (2016) 5:78 Evolutionary algorithm for building the fuzzy model of a group

Outline of the evolutionary algorithm
In the proposed algorithm, whose outline corresponds to the outline presented in Ishibuchi et al. (1995), we perform evolution only at the level of fuzzy rules. This means that we do not perform any fine-tuning of membership functions of input variables. We choose this approach to preserve comprehensibility for humans of the fuzzy rules in the system. The outline of the algorithm is as follows: 1. Randomly generate initial population R = {R i } of µ individuals, i = 1, 2, . . . , µ.
3. Check termination condition: if it is satisfied, stop; continue otherwise. 4. Select pairs of individuals and put them into set R ′ . 5. Recombine pairs of individuals from R ′ with a recombination operator REC R i , R j , i = 1, 2, . . . , , j = + 1, . . . , 2 · . Put the offspring into set R ′′ . 6. Mutate individuals from R ′′ with a mutation operator MUT R j , j = 1, 2, . . . , . 7. Replace individuals from R that have the lowest fitness values with the mutated offspring. 8. Go to step 3.

Representation and fitness function
In this work, we treat each individual R i ∈ R, i = 1, 2, . . . , µ, as a single rule in the fuzzy rule set being evolved. I.e., the whole population constitutes the whole fuzzy rule set, in full concordance with the Michigan approach.
We propose to represent each rule R i , i = 1, 2, . . . , µ, as a vector of integer values where R ij is a certain index of the fuzzy value of a linguistic variable L j . Availability of values LL 0 j , j = 1, 2, . . . , t H , in R i enables us to evolve rules that don't take into account values of the attribute w H b j . In other words, the evolutionary process can lead to obtaining more generalized rules.
In this work, we evaluate fitness of each individual R i in terms of its quality measures introduced earlier:

Other algorithm parameters
Operator REC R i 1 , R i 2 should be a proper recombination operator for integer representation applied with a high probability p c to two individuals R i 1 and R i 2 that yields two offspring individuals R j 1 and R j 2 . Operator MUT (R) should be a proper mutation operator for integer representation applied with a low probability p m to a single individual R that yields the mutated one R ′ .
In this paper, we will use uniform crossover (Syswerda 1989) as a recombination operator and random resetting mutation (Eiben and Smith 2015, p. 43) as a mutation operator. We will also choose the following algorithm parameters: • we will choose tournament selection (Brindle 1981) as an efficient and easy to implement selection operator, with the tournament size 10; • we will create initial populations by randomly generating values of each fuzzy rule element R ij , i = 1, 2, . . . , µ, j = 1, 2, . . . , t H , from a uniform distribution on 0, l L j ; • we will choose the number of generations N as a termination condition, i.e., we will terminate the algorithm after having obtained N consequent populations.

General information
In previous sections, we have shown that the TPGA is a pressing one, and group distributions need to be protected even when vital attributes are removed from the microfile.
In this section, we will discuss the memetic algorithm (MA) for solving the task of providing group anonymity. This algorithm was introduced in Chertov and Tavrov (2014), and we will heavily rely on that publication when presenting the algorithm here. We will assume that the data publisher decides to remove vital attributes from the microfile. As pointed out before, to provide group anonymity, we need to mask outliers in an auxiliary quantity signal obtained using appropriate fuzzy rules.
The general outline of a single-stage approach to solving the TPGA is as follows: 1. Prepare a (depersonalized) microfile M representing data to be anonymized. 2. Define groups of respondents G i (V i , P i ), whose quantity signals need to be masked, i = 1, 2, . . . , k. 3. For each i from 1 to k: In order to modify the auxiliary quantity signal for a given group in a given microfile, we need to physically alter some of the values in the microfile, more precisely, alter parameter values for certain records. To preserve the number of records with a particular parameter value, the records have to be altered in pairs, which can be interpreted as swapping the records between submicrofiles. One record needs to belong to the fuzzy model of a group, and another needs not to.
As mentioned before, to solve the TPGA means not only to modify the auxiliary quantity signal, but also to introduce as little distortion into the microfile as possible. To this end, the records being swapped have to be close to each other is some sense. In this work, we will apply the influential metric (Chertov 2010) to determine the degree of similarity between two microfile records. This metric is defined in terms of so called influential attributes, i.e., those ones whose distribution is important for further researches using microfile data. In this work, we will assume that influential attributes are the same as the basic harmonized attributes.
The influential metric is defined as where I p is the pth ordinal basic attribute (their overall number is n ord ), J k is the kth categorical basic attribute (their overall number is n cat ), χ (v 1 , v 2 ) denotes the operator that equals to χ 1 if values v 1 and v 2 fall into one category, and equals to χ 2 otherwise, ω p and γ k are non-negative weighting coefficients (the bigger the coefficient, the more important is the attribute for the researches). Preserving data utility from the minimal data distortion point of view is a task of high complexity and dimensionality, therefore, it is a good idea to use MAs (Moscato 1989) to solve the TPGA. MAs are typically implemented as evolutionary algorithms with local search procedures (Eiben and Smith 2015, p. 173). New applications of MAs to solving complex optimization tasks can be found in Kumar et al. (2014).

Outline of the algorithm
An outline of a memetic algorithm for modifying the microfile M in order to protect outliers in corresponding quantity signal is as follows: 1. Create population P of µ individuals, apply to them local search operator S. 2. Calculate fitness function f (x) for each individual x ∈ P. 3. Check termination condition. It if holds, stop, otherwise, go to 4. 4. Select pairs of parents. 5. Apply recombination operator R to each parent pair. 6. Apply mutation operator M to each of offspring. Put the offspring into P ′ . 7. Apply local search operator S to each individual x ∈ P ′ . 8. Calculate fitness function f (x) for each individual x ∈ P ′ . 9. Select µ individuals from P ∪ P ′ , put them into P in place of current ones. 10. Go to 3.
In the algorithm outline above, we made use of several symbols introduced earlier, but with a different meaning. We hope it will be understandable from the context, what symbols mean in each particular case.
Each individual is a matrix U with Q rows and four columns with the following elements: 1. The first column contains indexes u i1 ∀i = 1, 2, . . . , Q of submicrofiles to remove vital records from. The user has to define the set of such submicrofiles. 2. The third column contains indexes u i3 ∀i = 1, 2, . . . , Q of submicrofiles to add vital records to. The user has to define the set of such submicrofiles.
(25) InfM(r, r * ) = n ord p=1 ω p r I p − r * 3. The second column contains indexes u i2 ∀i = 1, 2, . . . , Q of the records from M u i1 to be removed. 4. The fourth column contains indexes u i4 ∀i = 1, 2, . . . , Q of the records from M u i3 to be swapped with the ones defined by u i2 .
By its nature, each individual U uniquely defines the modified quantity signal q * , and also determines the particular way of obtaining it, because each row in U defines a particular pair of respondents to be swapped. Thereby, each U defines a complete solution to the TPGA at hand.
Two restrictions are imposed on each individual U: • a submicrofile index i can occur in the first column of U not more than q i times; • each pair u i1 , u i2 or �u i3 , u i4 � ∀i = 1, 2, . . . , Q cannot occur in U more than once.
• These restrictions cannot be violated throughout the algorithm run.
In this work, we propose to use the fitness function as the product where ϒ(U ) gives estimation of the solution quality in terms of minimizing microfile distortion, �(U ) gives estimation the solution quality in terms of protecting outliers in the quantity signal, and �(U ) is a penalty term against obtaining individuals with too many rows. We propose to use the following expression for the first term of (26): where C max is the greatest possible value of the cumulative influential metric (25), M i j is the operator yielding the jth record of the submicrofile M i , i = 1, 2, . . . , l p . Other terms of the fitness function can be chosen depending on the TPGA at hand. In this work, we use the following recombination operator R U i 1 , U i 2 . It generates two random crossover points k 1 ∈ 0, Q i 1 and k 2 ∈ 0, Q i 2 , splits each parent at appropriate points, exchanges the tails between them, and thus creates the offspring. This operator has to be applied with a high probability p c .
We also use the mutation operator that is a superposition M = M 4 • M 3 • M 2 • M 1 of the following operators: 1. M 1 is a swap mutation operator (Syswerda 1991) applied with a small probability p m 1 to the first column of U. Each pair u i1 , u i2 needs to be preserved ∀i = 1, 2, . . . , Q. 2. M 2 is also a swap mutation operator applied with a small probability p m 2 to the third column of U. Each pair u i3 , u i4 needs to be preserved ∀i = 1, 2, . . . , Q. 3. M 3 is a random resetting mutation operator (Eiben and Smith 2015, p. 43) applied with a small probability p m 3 to the second column of U. 4. M 4 is a random resetting mutation operator applied with a small probability p m 4 to the fourth column of U.
3. If r ≤ p mem , assign to u i4 the index of a record from M u i3 closest to the record defined by u i2 from M u i1 in terms of (25). Otherwise, assign to u i2 the index of a record from M u i1 closest to the record defined by u i4 from M u i3 in terms of (25). 4. Go to step 2.
Other MA components, such as selection, initialization, termination, population size etc. should be chosen individually for each TPGA to be solved.

Problem definition and microfile harmonization
To illustrate ideas developed in this work, we decided to set a task of violating anonymity of a group of regionally distributed military personnel in the U.S. Outliers in quantity signals representing such a distribution might point to sites of military facilities, some of which might potentially be classified.
We decided to choose the 1 % sample microfile of the American Community Survey (ACS) conducted in 2013 available from the IPUMS-International Project (Ruggles et al. 2010) as the microfile M we would like to violate group anonymity in. This microfile contains ρ = 1,380,924 records.
The microfile contains attributes Place of work: state, 1980 onward and Place of work: PUMA, 2000 onward (where PUMA stands for Public Use Microdata Area), that, if concatenated, give a unique code of a PUMA where a respondent works. We decided to replace these attributes with a single one called Place of work by concatenating the values of the attributes for each microfile record. The newly obtained attribute plays the role of the parameter attribute for our task.
The microfile also contains l = 1 vital attribute Occupation, SOC classification (where SOC stands for the 2010 Standard Occupational Classification system), which enables us to uniquely identify all the military personnel M v in the microfile, ρ v = 5,519.
We decided to choose the 5 % sample microfile of the 2000 U.S. Census also available from the IPUMS-International Project (Ruggles et al. 2010) as the auxiliary microfile M . This microfile contains ρ = 6,309,848 records. Since this microfile also contains attributes Place of work: state, 1980 onward and Place of work: PUMA, 2000 onward, we decided to replace them with the Place of work attribute in the same way as described above.
This auxiliary microfile satisfies all the necessary requirements: • records in M and in M are drawn from sufficiently similar distributions under assumption that demographics of respondents in both microfiles haven't changed much over 13 years; • M and M contain almost identical attributes, with the exception of several technical ones. In our example, we performed the following harmonization: • we replaced the Occupation, SOC classification attribute in both microfiles with a new one Military Personnel, which has only two values, 0 and 1. The value 1 was assigned only to those records that had one of the values of attribute Occupation, SOC classification presented in Table 2; • we removed all attributes from both microfiles except for Military Personnel, Place of work, and t H = 13 basic harmonized attributes, which we consider to be useful for building a fuzzy model of a group.
Information about each basic harmonized attribute w H b i , i = 1, 2, . . . , 13, is given in Table 3, where C stands for a categorical attribute, O stands for an ordinal one.

Input variables identification
In this section, we will discuss linguistic variables L j corresponding to basic harmonized attributes w H b j , j = 1, 2, . . . , 13. Each L j bares the name of the corresponding attribute w b j . Ranges l L j , u L j of acceptable values of base variables for each L j , j = 1, 2, . . . , 13, are given in Table 4.
We decided not to define values for variable L 13 . Its range of acceptable values was used to remove unacceptable records from the microfiles, but the attribute itself was not involved in the fuzzy rules evolved using the evolutionary algorithm.

Generating fuzzy rules by the evolutionary algorithm
In order to evolve fuzzy rules to obtain the auxiliary quantity signal for the practical example, we applied the evolutionary algorithm with the following parameters: • the population size µ was fixed at 100; • on each iteration, we replaced = 40 worst fit individuals with the newly obtained by applying recombination and mutation operators; • we applied recombination operator with the probability p c = 1.00, and mutation operator with probability p m = 0.05; • we performed 10 separate runs of the evolutionary algorithm, each of which lasted for N = 100 generations.
Of all the fuzzy rules obtained in all generations, we selected the fuzzy rules, whose RCF was greater than γ = 0.750 and support was greater than κ = 0.001. After that, we removed those rules that are more specific versions of the more general ones in the set, as described previously. In Table 5, we presented all of the resultant rules. For each fuzzy rule from the rule base, we specified its discriminative factor, relative confidence factor, and support. We present all the numerical values with 3 significant numbers, although the calculations were carried out with a much higher precision.
As we can see, all of these rules share one common characteristic, i.e., their value of variable L 8 is Walked, which means that all the respondents considered by the fuzzy rules as military personnel walked to their work rather than used a car or other means of transportation. Judging from the values of other variables, we can make general conclusions that these respondents typically are young males with medium yearly income.

Disclosing outliers in the group distribution using evolved fuzzy rules
To demonstrate how the evolved fuzzy rules can be used to violate outliers in the quantity signal, we will first apply them to the auxiliary microfile, and then proceed to disclosing outliers in quantity signals obtained for the main microfile.
Let us consider for illustration purposes the state of New York. In Fig. 1, we presented both the quantity signal (solid line) and the auxiliary quantity signal (dashed line). Values i = 1, 2, . . . , 61 over the x axis stand for the ith PUMA of the state of New York. The list of PUMAs can be found on the IPUMS-International website (PUMAs and Super-PUMAs 2000). The values over the y axis stand for: • in case of the quantity signal, the number of military personnel working in a corresponding PUMA; • in case of the auxiliary quantity signal, the sum of all membership grades assigned to the respondents in a corresponding PUMA by the evolved fuzzy rules.
Applying MTTT with α = 0.01 to the quantity signal, we can obtain the following index set: Analysis of the Report of the Deputy Under Secretary of Defense (2000) permits us to conclude that most of the indexes obtained by MTTT do not correspond to sites of military bases. Further on, we will assume that OUT e (q NY 2000 ) = {5, 42}, because: Applying MTTT with α = 0.01 to the auxiliary quantity signal, we can obtain the following index set: Taking into account previous discussion, we can assume that OUT e q aux NY 2000 = {5, 42} . Equality OUT e (q NY 2000 ) = OUT e q aux NY 2000 indicates that the sites of military facilities can be easily disclosed even if the vital attributes are removed from the microfile. , 4, 5, 7, 8, 16, 24, 28, 29, 30, 42, 44, 49, 51, . . . , 55, 59  In a similar fashion, we can analyze all the other states and determine undisclosed and false outliers. The overall figures are given in Table 6. We included in the table only those states, where the number of working military personnel exceeds 0.5 % of all the military personnel in original harmonized auxiliary microfile M H , i.e., the value 95.245 = 00.005 · 19, 049.
The confusion matrix (18) for this example is The tests (  Applying MTTT with α = 0.01 to the quantity signal, we can obtain the following index set: As pointed out before, analysis of (Deputy Under Secretary of Defense 2000) permits us to conclude that most of the indexes obtained by MTTT do not correspond to sites of military bases. Further on, we will assume that OUT e (q NY 2013 ) = {5, 29}, because: Applying MTTT with α = 0.01 to the auxiliary quantity signal, we can obtain the following index set: Taking into account previous discussion, we can assume that OUT e q aux NY 2013 = {5, 29}. I.e., both outliers are clearly visible in the auxiliary quantity signal as well.
Analogous results for other states are given in Table 7. We once again included in the table only those states, where the number of working military personnel exceeds 0.5 % of all the military personnel in original harmonized microfile M H , i.e., the value 27.595 = 0.005 · 5, 519.
This leads to the following fitness function: where C max = 299; W k , k = 1, 2, . . . , 13, is the kth basic attribute; M j (i, W k ) returns the value of the attribute W k of the ith record in M j ; ZMF (x, a, b) is a function defined as Each row in (28) corresponds to a single part of the fitness function (26).
To simplify the matters, we considered all the basic attributes to be categorical ones with following parameters of (25): γ k = 1 ∀k = 1, 2, . . . , 13, χ 1 = 1, χ 2 = 0. The metric (25) defined this way shows the number of attribute values that need to be physically altered during one swap of the records between the submicrofiles.
We decided to apply tournament selection (Brindle 1981) as an efficient and easy to implement selection operator, with the tournament size 5. Other algorithm parameters were chosen as follows: µ = 100, = 40, p c = 1, p m 1 = p m 2 = p m 3 = p m 4 = 0.001 , p mem = 0.75. We terminated the algorithm after having obtained 1000 consequent populations.
The population was initialized by randomly generating matrices with different numbers of rows. Elements of the first column were generated with probabilities proportional to the values of the corresponding elements of q. Elements of the third column were generated with probabilities proportional to the total numbers of records in corresponding submicrofiles.
During the MA run, we applied linear fitness scaling in the form presented in Goldberg (1989, p. 79) to prevent premature convergence. We also multiplied the mutation probabilities by the factor of 10 whenever the standard deviation of the population fitness values dropped below 0.03.