Identification of differentially expressed genes in microarray data in a principal component space

Microarray experiments are often conducted in order to compare gene expression between two conditions. Tests to detected mean differential expression of genes between conditions are conducted applying correction for multiple testing. Seldom, relationships between gene expression and microarray conditions are investigated in a multivariate approach. Here we propose determining the relationship between genes and conditions using a Principal Component Analysis (PCA) space and classifying genes to one of two biological conditions based on their position relative to a direction on the PC space representing each condition.


Introduction
Since the introduction of microarrays in the nineties, several methods have been developed for their analysis and especially for the detection of differentially expressed genes (Dudoit et al., 2002). The analysis of diverse microarray data sets has allowed a better understanding of biological phenomena at the molecular level. Nowadays, more and more microarray experiments are available and multivariate data analysisareneeded to process them and extract useful biological knowledge. Multivariate methods are useful for this task, as they allow reducing dimensions and revealing data structure (Lebart et al., 1995), which means that individuals and variables can be separated on a principal component space. Nevertheless, microarray data often appear to be very noisy and descriptive multivariate methods do not allow extracting enough knowledge from the data to explain biological phenomena (Dudoit et al., 2002). Due to this limitation, univariate methods concentrating on each gene at a time are used to detect differentially expressed genes (Tusher et al., 2001). These methods account for the context of high dimensions and high variability by applying multiple test corrections at different levels of hypothesis testing. Nevertheless, the multivariate context and possible emerging properties are lost (López-Kleine et al, 2012).
One multivariate method, that accounts for high dimension and improves data structure detection by combining dimension reduction and prediction is Discriminant Analysis on Principal Components (DAPC) (Jombart et al., 2010). This method applies a Discriminant Analysis (DA) on the principal component space (PCA) in order to separate individuals in populations without having an a priori on the data structure. Grouping needed for DA is obtained previously by k-means. Moreover, this method allows a probabilistic assignment of individuals to the different groups that are obtained. Initially, the authors have applied DAPC to Single Nucleotide Polymorphism (SNP) allele frequency data.
As for SNPs data, in which each SNP on the genome can be seen as a new variable, microarray data can be understood as profiles: each row (gene) represents an individual and the column values generate the expression profiles through microarray experiments (here: replicates of two conditions). Data are intensity measures of messenger RNA present for each gene at the given biological condition. So, multivariate methods will be used to reveal position of individuals (genes in this case) relative to the replicates of biological conditions (variables). A multivariate method that could deal with the mentioned noise and the multivariate structure of microarray data is the Principal Component Analysis. Once microarray condition replicates (variables) are reduced to a smaller number of PCs, each replicate can be represented through a linear combination on the PC space and the closeness of genes to each biological condition can be determined taking into account the factorial map of variables obtained through the PCA.
The aim of this work was to propose a closeness measure C d that allows determining genes that belong to each one of two biological conditions, each of which is characterized by a direction D on the PC space. This methodology can be extended to more than two conditions because on a principal component space several groups of variables could be detected and classification of genes to more than one group can be undertaken applying the same methodology. The obtained belonging to a condition is comparable to a classification into a group obtained by DAPC. We also compared classifications combining these two methods. We applied these methodologies to real and simulated data and conclude that the best results are obtained for C d or the combination of both methodologies and that the performance depends on the data structure.

Data
Microarray data sets were obtained from the Tomato Expression Database website (http://ted.bti.cornell.edu/). In this study we used experiments that were carried out using the TOM1 DNA chip. For the differential expression analysis we focused on the experiments carried out by Christine Smart and collaborators (Accession number E022: http://ted.bti.cornell.edu/cgi-bin/TFGD/miame/experiment.cgi?ID=E022) where gene expression profiling of infection of tomato Phytophthora infestans in the field was studied. The goal of this experiment was to gain insight into the molecular basis of the compatible interaction between P. infestans and its hosts. We used the data from that experiment in order to apply the new C d closeness measure or the previously described DAPC or a combination of both for detecting genes that were differentially expressed in P. infestans inoculated plants vs. non inoculated plants in the field. For this comparison four time points were available at 0, 12, 36 and 60 hours with 8 replicates of each condition. For analysis here we focused on the 16 experiments available for the last time point (60 hours after inoculation).

C d closeness measure
In order to obtain a measure that expresses the distance of a gene with respect to a direction on the principal component space (PC) in R n we propose the measure C d . This measure will give information about how close a gene is to a given direction that represents a given biological condition for which several measures (replicates) are available. This measure is proposed based on the orthogonal projection (closest distance from a point to a line) and norm. We propose to use the projection of the gene vector on the direction in order to take account for the distance of the gene to the origin and therefore highlight genes that have an expression that differs from zero and from the mean expression behavior. Gene expression profiles containing more noise than signal will tend to be placed close to the origin and gene expression profiles that are not related to the direction of interest will be far from the direction and both cases should therefore have a low C d value. Moreover, genes having expression profiles close to one direction will be far from the origin and close to the direction (Figure 1). The ratio between the norm of the projection (of the gene on the direction representing a condition) and the projected vector norm tends to 1 when the angle between the projection and the direction vector tends to 0. Basically, C d will express values of 0 when a gene is far from the vector representing a given condition and 1 or −1 when a gene is close to the condition. The condition being here represented by a vector constructed as a linear combinations of all replicates of the condition of interest. Therefore, genes will be identified as belonging more to one or another condition and will be classified as behaving differentially between the biological conditions measured, based on theC d closeness measure (Figure 1).
The C d closeness measure is given by:  Figure 1 Schematic representation of the directions and norm of the orthogonal projection of genes used to compute Cd. Gene g3 will have the highest value of Cd as its position is close to the direction and far away from the origin. Gene g1 will have the second highest value of Cd. Gene g2 will have a negative value of Cd. Genes g4 and g5 will have values of Cd close to zero.
<>indicates the usual inner product. sign (.) returns the sign (1 or −1) of its input and → v 2 : The C d closeness measure has three important zones: ship with the direction and therefore no differential expression comparing both conditions. C d close to one indicates a strong relationship to one condition and therefore a differential expression (activation of the gene in the condition to which it is close). This occurs when → v∝ → d i and → v→1 are values for which the vectors have a positive cosine to the projection on → di, and their value tends to 1 when the angle between then becomes shorter. On the other hand, when the angle is large, the C d value will tend to 0.

C →d
indicate values for vectors that coincide with the direction → di; as higher is their norm (and they are farer away from the origin), the C d closeness measure will be closer to 1.
It is also important to mention that C d could take negative values when the genes are placed on the other side of the direction. This means that −1 and 1 have the same meaning, a close relationship to the direction. For an example of GENES with high C d values see figure 2.
Most of the genes do not change drastically when biological conditions change. So, in order to limit the search, 25\% of genes placed in the center between conditions in the PC space are not considered for classification. Genes that have norms below the first quartile (first 25\% smallest norms) are filtered fixing their C d to 0.

False positive rate measure
Although the false discovery rate (FDR) was proposed for multiple hypothesis testing (Benjamini and Hochberg, 1995) in order to control false rejecting of null hypotheses it can also be used in the case of classification. FDR is defined by Benjamini and Hochberg (Benjamini and Hochberg 1995) where V is the number of hypotheses declared significant and that are in fact true and R is the total number of hypotheses declared significant (total hits), therefore the FDR is the expectation of V/ R if R is not equal to zero and is zero otherwise.
Here, the comparison between scenarios based on simulated data is done using measures as true positives (Tp), true negatives (Tn), false positives (Fp) and false negatives (Fn) and using the FDR is playing the role of V and FP+Tp is playing the role of R in the proper definition of the FDR. The classification of positive or negative hits was done using different C d thresholds.

General case simulation
Data were simulated for different case scenarios in order to investigate if C d (and DAPC) are useful for detecting differentially expressed genes between conditions (here two). Data under four different scenarios (favorable (F), normal (N), unfavorable (U) and very bad (B)) were simulated. These scenarios were simulated confounding progressively individuals (genes, labeled as Sim1) and variables (replicates of biological conditions, labeled as Sim2), Figures 3 and 4. When analyzing real data both types of confusions are common: one in which genes are not structured but are, on the contrary very similar and another in which microarray conditions do not differentiate between each other. The goal of investigating performance on different scenarios was to characterize limitations of the DAPC and C d closeness measure.

Data generation
For the Sim1 data generation it was assumed that microarray data have a normal distribution (which is generally true after calibration and transformation of gene expression data). Here, microarray data belonging to two groups with differential expression were generated. So, a matrix X with n 1 +n 2 =n rows (genes) and p 1 +p 2 =p columns, where n 1 andn 2 , are the number of genes belonging to each condition and p 1 and p 2 are the number of replicates of each condition. First, a sample from a multivariate normal distribution with mean vector 0 and correlation matrix approximately I is taken. The Figure 2 Example of genes with Cd> 0.9 (red dots) and therefore found near the direction (blue arrow) representing one biological condition for which replicates of microarray measurements are available.
choice of the correlation matrix being exactly I or approximately I allows choosing the level of noise to be introduced in the data. Second, for the block X 11 =X[1: n 1 ,1p _1 ] of the matrix values sampled from a multivariate normal distribution with mean vector 0 and correlation matrix W 1 were added to the previously generated values originating distortion on the expressions for one of the two conditions (Sim1). Values sampled from a multivariate normal distribution with mean vector 0 and correlation matrix W 2 are added in the same way to the block X 22 =X[1:n 1 ,1p _1 ].
For the simulation trying to confound replicates of the two conditions (Sim2) a truncated normal distribution was used to add to the replicates of the two conditions. This second method is an adaptation of the simData function in the R library optBiomarker (Khondoker et al., 2009) but does not differ in its methodology from the data generation used for Sim1.
The values of mean and correlation matrix for the four scenarios were chosen based on the separation of genes observed on the plot of the first two principal components (PC) when a Principal Component Analysis was conducted on the simulated data as shown in Figures 3  and 4. Sim1 is a simulation that confounds gene expression between two conditions progressively as scenarios go from favorable (F) to very bad (B) and Sim2 is a simulation Figure 4 Factorial plot of the first two PCs of simulated data starting with a favorable scenario in which conditions are not confounded on the left and ending with a very bad scenario (of high confusion) on the right (Sim2). that confounds replicates of both conditions progressively as scenarios go from favorable (F) to very bad (B) as shown in Figures 3 and 4.

Identification of differentially expressed genes through classification
Here we propose four possible classifications using C d and DAPC in order to classify genes into one of two conditions and therefore detect genes that are differentially expressed. The classifications we compared are the following: -DAPC: Classification only by DAPC specifying a classification into three groups. These three groups are thought to group genes into a class of genes have similar expression across the different conditions and are clustered around the origin and two groups of genes far from the origin clustering genes into one of the two experimental conditions that are compared. -C d with DAPC: Classification by DAPC of the genes detected by C d to be in a zone of interest in the PC space (C d >threshold). This classification is based on the groups proposed by DAPC and additionally takes into account only genes with a C d higher than certain threshold to be fixed by the researcher. The values of C d and therefore their thresholds can be fixed individually for each condition. -C d : Classification of the genes detected by C d to be close to one of the two conditions. This classification is based on the genes with C d higher than a certain threshold without using the groups proposed by DAPC. Here again, the values of C d and therefore their thresholds can be fixed individually for each condition. -C d inter DAPC: Intersection of genes obtained by DAPC and C d separately. Here the groups are constructed with the genes that are classified independently to belong to the same biological condition by DAPC and C d and then, genes classified by both methods, are used to identify the belonging to one of the two conditions.

FDR estimation
As simulated data was used, it is possible to identify genes correctly belonging to each of the two conditions analyzed. The FDR was calculated only with the genes classified into one of the two biological conditions, which means that the genes which were classified in a third group (DAPC-labeled as ''no differential gene expression'') were not taken in to account at the moment of the FDR estimation. It is worth to point out that the FDR is often used in literature to report reliability of classification into one category only. Here it is used to report a global accuracy measure of classification in two groups. In this specific case the true positives (Tp) are the number of accurate classifications in any of the biological conditions. False positives (Fp) are the number or genes that are wrongly classified into the category ''no differential gene expression''.

Directions representing conditions
Directions in the PC space are obtained as a mean vector of variable vectors in the PC space (Figure 4). The red arrows represent the variables (biological replicates of one condition) and blue arrows represent replicates of another condition. The mean vector of each group will represent the condition and will be used as the direction to calculate the C d closeness measure. This can only be assumed if a previous PCA confirms that replicates of conditions are not confounded (Lebart et al., 1995), but are linearly correlated to each other and less correlated to replicates from the other condition. An example of directions (mean of vectors representing replicates of the same condition) is shown in Figure 5.

C d closeness measure threshold
In order to determine the best value of C d for an accurate classification (based in simulated data) the following steps were undertaken.
-Selecting the scenario (F to B). -Generating data for each combination of parameters as described before.
-Obtaining the coordinates of the genes in the PC space.
-Obtaining C d and setting C d =0all genes with a norm less than the 25th percentile of norms. -For each threshold of C d on starting from 0.970 to 0.999 with steps of 0.004, FDR was calculated for genes with | C d |> threshold.
-Obtaining the FDR estimation as the mean of Fp FpþTp for each set of parameters and C d thresholds (0.970 to 0.999).
-Obtaining the best C d threshold (maximum threshold that minimizes the FDR). -Repeating N times to obtain a sample of FDR estimations for the whole set of parameters and best thresholds.
-Obtaining the mean of the sample for each of the parameter of interest.
As this is a mean of means it can be called a global mean or global average (avg).
-Repeat all steps for each scenario and for both simulated data sets (Sim2 and Sim1) Figure 5 Corcircle of a PCA conducted on the tomato microarray data set showing that replicates of the same biological conditions (Ni and I) are strongly correlated and that a mean direction representing each condition can be obtained. Blue arrow represents Ni and red arrow represents condition.

Identification of differentially expressed genes in simulated data
The main purpose of this work was to obtain a differential gene detection using a multivariate method. This has been conducted as a classification of genes in the PC space and their belonging to one of two conditions. We compared and combined the here proposed closeness measure C d with the previously proposed DAPC (Jombart et al., 2010) in order to evaluate accuracy of these methods based on simulated data.
Simulated data allowed studying the accuracy of both methods and their combination given different degrees of gene and condition confusion. The four scenarios and four methods are explained in previous sections. For data, in which genes do not have a clear structure in separate groups (sim2), the C d closeness measure posterior to a PCA is the best method, and for data in which genes form observable groups (sim1) C d closeness measure with DAPC performs better. The four classifications used are compared here based on their FDR, which increases when simulation scenarios tend to unfavorable and therefore more confusion is present (Figure 6). The obtained FDR values are observable in Tables 1 2, 3 4. These results allow us to state that, on one hand, when data shows an underlying structure that allows grouping genes (individuals) into clearly defined groups, the best method in order to obtain a reliable set of genes classified on the biological conditions is to perform DAPC and select the genes whose C d closeness measure is greater than a particular threshold. In our example we show that for a threshold of 0.998, the obtained FDR is lower than 0.3 in the worst scenario of simulated data. On the other hand, when biological conditions (variables) are confounded, the best method to classify genes is using only C d . For a given threshold 0.987, we obtained FDR not greater than 0.12 in the worst scenario.
Identification of differentially expressed genes in real data (tomato microarray data set) PCA First a PCA is conducted in order to determine the biological conditions that form directions on the PC space. As mentioned before, if this is not possible and replicates of biological conditions are confounded in directions, the PC space will not allow identifying directions and there for neither C d nor DAPC will be suitable. For the real tomato microarray data example analyzed here, biological conditions clearly differentiate between: Not inoculated (Ni) and inoculated plants (I) ( Figure 5) and therefore identifying differentially expressed genes through the here proposed methods is suitable.

C d applied classification to the tomato dataset
Taking into account the data structure of the tomato data set ( Figure 5) and the results obtained through simulation, the method that should be applied is a classification through C d alone. Biological conditions are easy to represent in the factorial plane ( Figure 5) and there is not a particular case of grouping structure within the genes, which set us in the case of a Sim2 normal scenario (Figure 4). With this classification an FDR value no greater than 0.01456 is expected (conditions of a normal scenario). In this classification 594 genes were classified in the I group (red) and 295 genes were classified in the Ni group (Blue) among 13440 genes evaluated (Figure 7). Much more genes are classified in the I group than in the Ni group. This results are in accordance with previous results obtained analyzing this data using SAM (Tusher et al., 2001) methodology (unpublished) where for an FDR of 0.03, 734 genes were found to be upregulated at time-point 60 hours and all 594 genes identified here make part of them. No genes were found to be downregulated in this previous work. This result indicates that C d is more sensitive than classical methods but also that care has to be taken with thresholds in order to avoid false positives. Genes classified by C d are an indication of belonging to one or the other condition but are not a statistical proof of differential expression.

Discussion
The proposed closeness measure C d presents several advantages to classical univariate and classification methods. It allows identifying genes that belong to one or the other biological conditions and therefore have differential expression between these conditions. The fact that conditions are used as a reference for grouping, allows enhancing sensitivity for these detection. Moreover, FDRs are lower when our method is used for identification of differentially expressed genes, but this is just an indicative value, because the methodology presented here is not inferential (no P-value or FDR can be calculated for each classification). Multiple testing is avoided and a global approach allowing the detection of emerging properties due to multivariate data structure is possible with this methodology.
In general, clustering methods tend to perform their analysis over the data structure only (Lebart et al., 1995). This means that they try to create groups under distance  and variance criteria (Jombart et al., 2010) among genes and groups, but do not take into account the variables, conditions or directions. Having a reference in the variable space makes a real difference in classification and proposes a precise and accurate classification. In this paper we illustrated the principal difference of using a method that classifies under data structure criteria like DAPC (Jombart et al., 2010) and a method that classifies under a direction or condition criteria(C d ) and a combination of both. The consequence of using one method or another is shown in the FDR obtained (Tables 1  and 2), which can be very high (around 0.5 or 0.3, Figure 6) instead of 0.3 or 0.1 (for the worst scenario). The improvements are especially noticeable when biological conditions are confounded, which makes it very difficult for classical classification methods to detect differences in gene expression when comparing two conditions because no reference is present. By tuning an appropriate C d threshold, it is still possible to detect differentially expressed genes even though some variables of the same condition are confounded. Moreover, the use of C d alone or a combination with DAPC allows using the most appropriate method depending on data structure. Another feature of the presented work is the possibility of reducing the amount of genes classified to the biological conditions by increasing the C d threshold which would lead to groups of genes for which there is more certainty of differential expression, and moreover reduced groups of genes to perform lab proves.
Finally, we propose using the methods C d with DAPC when there is a data structure, expecting in the worst case an FDR around 0.3 and C d when there is not a particular data structure but there exist a sense of biological conditions or directions, expecting in the worst scenario an FDR around 0.1, for classification of biological differentially expressed genes.  Figure 7 Result of the Cd classification of the genes of the tomato data set using a threshold of 0.99.