 Research
 Open Access
 Published:
An omnibus permutation test on ensembles of twolocus analyses can detect pure epistasis and genetic heterogeneity in genomewide association studies
SpringerPlus volumeÂ 2, ArticleÂ number:Â 230 (2013)
Abstract
Abstract
This article presents the ability of an omnibus permutation test on ensembles of twolocus analyses (2LOmb) to detect pure epistasis in the presence of genetic heterogeneity. The performance of 2LOmb is evaluated in various simulation scenarios covering two independent causes of complex disease where each cause is governed by a purely epistatic interaction. Different scenarios are set up by varying the number of available single nucleotide polymorphisms (SNPs) in data, number of causative SNPs and ratio of case samples from two affected groups. The simulation results indicate that 2LOmb outperforms multifactor dimensionality reduction (MDR) and random forest (RF) techniques in terms of a low number of output SNPs and a high number of correctlyidentified causative SNPs. Moreover, 2LOmb is capable of identifying the number of independent interactions in tractable computational time and can be used in genomewide association studies. 2LOmb is subsequently applied to a type 1 diabetes mellitus (T1D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC). After screening for SNPs that locate within or near genes and exhibit no marginal singlelocus effects, the T1D data set is reduced to 95,991 SNPs from 12,146 genes. The 2LOmb search in the reduced T1D data set reveals that 12 SNPs, which can be divided into two independent sets, are associated with the disease. The first SNP set consists of three SNPs from MUC21 (mucin 21, cell surface associated), three SNPs from MUC22 (mucin 22), two SNPs from PSORS1C1 (psoriasis susceptibility 1 candidate 1) and one SNP from TCF19 (transcription factor 19). A fourlocus interaction between these four genes is also detected. The second SNP set consists of three SNPs from ATAD1 (ATPase family, AAA domain containing 1). Overall, the findings indicate the detection of pure epistasis in the presence of genetic heterogeneity and provide an alternative explanation for the aetiology of T1D in the UK population.
Background
Epistasis or genegene interactions are among many causes of complex diseases (Moore 2005). In the simplest form, epistasis can be described by twolocus disease models, in which both loci jointly contribute towards the disease susceptibility (Neuman and Rice 1992; Schork et al. 1993). Many attempts have been made to provide consistent definitions of epistasis (Cordell 2002; HallgrÃmsdÃ³ttir and Yuster 2008; Li and Reich 2000; Marchini et al. 2005; Musani et al. 2007; Verhoeven et al. 2010). Regardless of preferred definitions, a common ground for describing epistasis covers an effect deviating from the combined individual effects of each genetic factor. In other words, epistasis describes an effect that departs from a linear addition of individual effects (Fisher 1918). The detection of epistasis hence provides necessary information complementary to that gained through singlelocus analysis.
With the availability of genomewide genotyping technologies, a large number of single nucleotide polymorphisms (SNPs) can be considered during epistasis detection (Heidema et al. 2006; Motsinger et al. 2007; Van Steen 2012). At present, the most feasible strategy for genomewide epistasis detection involves twolocus analysis (Evans et al. 2006; GayÃ¡n et al. 2008; Ionita and Man 2006; Liu et al. 2011; Marchini et al. 2005; Sha et al. 2009; Wongseree et al. 2009). The detection may concentrate on all possible SNP pairs (GayÃ¡n et al. 2008; Liu et al. 2011; Marchini et al. 2005; Wongseree et al. 2009) or only SNP pairs where at least one SNP in each pair exhibits a marginal singlelocus effect (Evans et al. 2006; Ionita and Man 2006; Liu et al. 2011; Marchini et al. 2005; Sha et al. 2009). Exhaustive twolocus analysis is generally required when pure epistasis (Culverhouse et al. 2002) is present. This is because each interacting SNP in a purely epistatic model exhibits no marginal singlelocus effect. Although the importance of pure epistasis remains in question (Cordell 2009), many genetic association studies reveal that putatively pure epistasis plays a role in determining disease susceptibility (Cho et al. 2004; Jiang and Neapolitan 2012; Zhang et al. 2008).
In addition to epistasis, twolocus (HallgrÃmsdÃ³ttir and Yuster 2008; Li and Reich 2000; Neuman and Rice 1992; Schork et al. 1993) and multilocus disease models (Edwards et al. 2009; Lunetta et al. 2004) also describe other phenomena. One particular phenomenon that makes the capture of genetic factors responsible for complex diseases a difficult task is genetic heterogeneity. Basically, genetic heterogeneity models define independent effects that cause the same complex disease. Since it is impossible to know beforehand that each affected individual participating in genetic association studies is predisposed to which independent effect, the presence of genetic heterogeneity always leads to the reduction in statistical power to detect causative SNPs (Edwards et al. 2009; Lunetta et al. 2004; Meng et al. 2009; Ritchie et al. 2007; Ritchie et al. 2003).
From a machine learning viewpoint, the identification of causative SNPs among available SNPs in genetic association studies can be treated as an attribute selection problem. The aim of attribute selection is to identify informative attributes necessary for the correct classification of recruited samples. Saeys et al. (2007) categorise attribute selection techniques into three main approaches: filter, wrapper and embedded approaches. The filter approach interests in identifying SNPs associated with the disease according to a statistical or mathematical measure. The wrapper approach attempts to search for the best SNP combination that provides the highest prediction accuracy dictated by a classifier. The embedded approach uses available SNPs to construct a prediction model while simultaneously prioritises informative SNPs.
Among the wrapper techniques, a technique which is proven to be capable of detecting pure epistasis in the presence of genetic heterogeneity is a multifactor dimensionality reduction (MDR) technique (Edwards et al. 2009; Ritchie et al. 2007; Ritchie et al. 2003). MDR searches for the best SNP combination that yields the highest prediction accuracy according to the rules governed by multidimensional decision tables (Ritchie et al. 2001). Although the detection power of MDR is high, the demonstration has been limited to simulations consisting of two independent purely epistatic twolocus interactions. Moreover, MDR is a timeconsuming technique and hence requires large computational efforts for multilocus analysis in genetic association studies with a large number of SNPs (Edwards et al. 2009; Kwon et al. 2012; Pattin and Moore 2008; Ritchie et al. 2001; Wongseree et al. 2009).
Similar to MDR, a random forest (RF) is an embedded technique which is also proven to be capable of detecting epistasis in the presence of genetic heterogeneity (Lunetta et al. 2004; Meng et al. 2009). RF consists of multiple decision trees in which each tree is randomly constructed from available SNPs. Causative SNPs can be identified by permuting the genotype of each SNP and observing how this affects the overall prediction accuracy (Breiman 2001). The detection power of RF has been demonstrated through simulations involving multiple independent epistatic multilocus interactions. Nonetheless, the previous studies concentrate on epistasis with marginal singlelocus effects. As a result, the ability of RF to detect pure epistasis has not yet been determined.
Unlike genetic association studies that use wrapper and embedded techniques, most studies involving filter techniques rarely consider scenarios which cover genetic heterogeneity. However, one filter technique which should be suitable for detecting pure epistasis in the presence of genetic heterogeneity is an omnibus permutation test on ensembles of twolocus analyses or 2LOmb (Wongseree et al. 2009). 2LOmb exhaustively performs twolocus analysis on casecontrol SNP data by Ï‡ ^{2} tests. The best ensemble of SNP pairs is then progressively constructed where the statistical significance of the association between the ensemble and the disease is determined by a permutation test. 2LOmb is suitable for detecting purely epistatic twolocus interactions and purely epistatic multilocus interactions with marginal twolocus effects (Wongseree et al. 2009). In addition, 2LOmb has been successfully benchmarked against an exhaustive twolocus analysis technique, a set association approach (Hoh et al. 2001), a correlationbased feature selection technique (Hall and Holmes 2003) and a tuned ReliefF technique (Moore and White 2007). Although the study has been conducted without considering genetic heterogeneity, the result from an application of 2LOmb to a real casecontrol data set, derived from a genomewide data set by focusing on SNPs within or near candidate genes, suggests that 2LOmb can function when genetic heterogeneity is present. Previously, 2LOmb identifies 11 intronic SNPs which exhibit no marginal singlelocus effects and are associated with type 2 diabetes mellitus (T2D) in a UK population (The Wellcome Trust Case Control Consortium 2007): four SNPs in PGM1 (phosphoglucomutase 1), two SNPs in LMX1A (LIM homeobox transcription factor 1, alpha), two SNPs in PARK2 (parkinson protein 2, E3 ubiquitin protein ligase (parkin)) and three SNPs in GYS2 (glycogen synthase 2 (liver)). The results also suggest that there are no interactions between genes (Wongseree et al. 2009). Obviously, this finding signifies the power of 2LOmb to detect genetic heterogeneity. Nevertheless, a thorough investigation by simulations is still required. In addition, the possibility of applying 2LOmb to a genomewide data set also needs to be explored.
In this article, the ability of 2LOmb to detect pure epistasis in the presence of genetic heterogeneity is demonstrated. 2LOmb is benchmarked against MDR and RF in various simulation scenarios generated by varying the number of available SNPs, number of causative SNPs and ratio of case samples in which the disease status is governed by different purely epistatic interaction models. The statistical power of 2LOmb to directly identify the number of independent interactions in simulated data from its output is subsequently evaluated. The application of 2LOmb to a genomewide type 1 diabetes mellitus (T1D) data set is also included. In this study, the genomewide T1D data set is chosen instead of the T2D data set because 2LOmb does not detect any purely epistatic interactions in the T2D data set.
Results and discussion
Testing with smallscaled simulated data
2LOmb is benchmarked against MDR and RF in a simulation trial involving both pure epistasis and genetic heterogeneity. An output from an efficient algorithm should contain a low number of SNPs and a high number of correctlyidentified causative SNPs. These two measures on the number of SNPs are the performance indicators. Each simulated data set contains 20 or 1,000 unlinked SNPs in which two independent purely epistatic interactions are present. Each interaction is based on one of the models investigated by Wongseree et al. (2009) and is governed by two, three or four causative SNPs. As a result, the interesting numbers of causative SNPs in each data set are 4 (2&2), 5 (2&3), 6 (3&3 or 2&4), 7 (3&4) and 8 (4&4). The allele frequencies of all causative SNPs are 0.5; these are dictated by the purely epistatic models with penetrance tables derived by Culverhouse et al. (2002) and Wongseree et al. (2009). On the other hand, the minor allele frequencies (MAFs) of the remaining SNPs are between 0.05 and 0.5; these conform to the MAFs of SNPs targeted by the International HapMap Project (The International HapMap Consortium 2005). The allele frequency setting is similar to that in the early study by Wongseree et al. (2009). The data set consists of balanced casecontrol samples of size 1,600. All SNPs in control samples are in HardyWeinberg equilibrium. The case samples are drawn from two independent groups of affected individuals where the disease status of each individual from the same group is the result of the same purely epistatic interaction. This leads to the presence of genetic heterogeneity. The interesting ratios of case samples from two affected groups are 1:1 and 1:3. The genotype distribution of causative SNPs that produce an independent interaction follows the purely epistatic model, leading to the heritability of 0.01. Thirty independent data sets for each simulation setting are generated by genomeSIM (Dudek et al. 2006). Since the same simulated data sets are used during the benchmarking, a paired ttest can be applied to assess the significance of difference in algorithm performance.
The results from the problems with 20 and 1,000 SNPs in data are shown in Figures 1 and 2, respectively. It can be seen from Figure 1 that MDR fails to detect some causative SNPs. This suggests that the detection capability of MDR is lower than that of both RF and 2LOmb. Since it is highly unlikely that the MDR performance can improve when the number of available SNPs increases, the MDR simulation for the problem with 1,000 SNPs is not carried out. As a result, the MDR simulation is limited to the problem with 20 SNPs, which is similar to the study by Edwards et al. (2009). In Figures 1 and 2, the parameter setting of 1:3 for the ratio of case samples from two affected groups leads to two sets of results if the numbers of causative SNPs responsible for the two independent interactions are not equal. The first set of results is obtained when the loworder interaction is responsible for the affected status of individuals from the smallproportion group. On the other hand, the second set of results is obtained when the loworder interaction is responsible for the affected status of individuals from the largeproportion group. 2LOmb significantly outperforms MDR and RF in terms of the low number of output SNPs, the high number of correctlyidentified causative SNPs or both in the problems with 20 and 1,000 SNPs (a paired ttest on 15Ã—30=450 benchmark results for each problem yields a pvalue < 0.05). The statistical power analysis also reveals that the benchmark trial with 30 independent data sets for each simulation setting is sufficient for an accurate evaluation of the overall algorithm performance (power > 0.95 for a type I error rate of 0.05). The simulation results can be further interpreted as follows.
MDR functions by attempting to identify a SNP combination which leads to the maximum prediction accuracy. In the presence of genetic heterogeneity, multiple SNP combinations are required where each combination is needed for the correct class prediction of a portion of casecontrol samples. If the proportions of samples at which their class labels can be predicted by different SNP combinations are equal, then each causative SNP contributes equally towards achieving high prediction accuracy. Subsequently, MDR is able to detect all or almost all causative SNPs. On the other hand, if the proportions of samples are not equal, then the attained prediction accuracy depends more on the ability to classify samples that occupy the large proportion. In other words, the inclusion of SNP combination necessary for the identification of class labels of samples that occupy the small proportion does not lead to an improvement of prediction accuracy. As a result, MDR fails to uncover some causative SNPs when the ratio of case samples from two affected groups is 1:3.
RF identifies causative SNPs by permuting the genotype of the interesting SNP and monitoring how it affects the prediction accuracy. It is aimed that the reduction in prediction accuracy as a result of the genotypic change of the causative SNP is more prominent than that of other SNPs. Although this is an efficient strategy, RF also selects erroneous SNPs as causative SNPs. This is observed from the number of output SNPs reported by RF which is greater than the number of correctlyidentified causative SNPs. The number of erroneous SNPs increases drastically when the number of available SNPs increases from 20 to 1,000. Moreover, the number of correctlyidentified causative SNPs also markedly decreases. This can be explained from the manner at which each tree is constructed. Basically, the tree construction begins by assigning a SNP, which provides the best split, from a randomly chosen SNP set as the root node, creating a split according to the genotype and sorting samples to the appropriate descendant node. This process is repeated until each final descendant node is assigned with samples from the same class or the maximum tree size, dictated by the number of available SNPs, is reached. Permuting the genotype of a SNP located at or near the root node produces a large effect on the prediction accuracy while permuting the genotype of a SNP in a descendant node produces a small effect. Since the chance that a causative SNP being located near or at a root node is small when the problem contains a large number of SNPs, the variable importance of causative SNPs obtained by the genotype permutation may not be markedly different from that of other SNPs. This subsequently leads to the degradation of the RF performance. Although the performance of RF can be improved by increasing the number of trees in the forest (Strobl and Zeileis 2008), it also leads to an increase in computational time. The computational time result for RF, which will be discussed later, provides evidence against the option of increasing the number of trees for this study.
As mentioned earlier, 2LOmb produces the best results among three techniques in the benchmark trial. 2LOmb is capable of detecting most of causative SNPs in every simulated data set. This performance is further strengthened by highly significant pvalues (2LOmbâ€™s global pvalue < 0.0001) and the presence of common SNPs among some or all SNP pairs that are parts of three and fourlocus interactions in the 2LOmb results. Nonetheless, some causative SNPs are missing from the 2LOmb output. Since the study is carried out by varying the number of available SNPs, the number of causative SNPs and the ratio of case samples from two affected groups, these parameters may influence the number of correctlyidentified causative SNPs. The parameter analysis is divided into two parts. The first part concentrates on the results from problems where the numbers of causative SNPs responsible for two independent interactions are equal while the second part concentrates on those where the numbers of causative SNPs are not equal. The analysis is divided in this manner because as mentioned earlier the parameter setting of 1:3 for the ratio of case samples from two affected groups leads to two sets of results in only the second part of the analysis. From both parts, analysis of variance (ANOVA) reveals that all three parameters are the sources of variation which significantly affect the number of correctlyidentified causative SNPs (p<0.05).
It is observed that the number of correctlyidentified causative SNPs decreases when the number of available SNPs is large. This is to be expected because the Bonferroni correction factor is a quadratic function of the number of available SNPs. An increase in the Bonferroni correction factor leads to an increase in the Bonferronicorrected Ï‡ ^{2}â€™s pvalue. If there are not enough samples for the twolocus analysis to produce a sufficiently low Bonferronicorrected Ï‡ ^{2}â€™s pvalue, some causative SNPs may be excluded from the output ensemble. This is highly evident when the ratio of case samples from two affected groups is 1:3. The variation in the ratio of case samples also leads to a change in the number of correctlyidentified causative SNPs. The magnitudes of Bonferronicorrected Ï‡ ^{2}â€™s pvalues for causative SNP pairs are similar when the numbers of case samples from two affected groups are equal. All causative SNPs can generally be identified in this scenario. On the other hand, the Bonferronicorrected Ï‡ ^{2}â€™s pvalues for SNP pairs responsible for the affected status of individuals from the smallproportion group are higher than those for SNP pairs responsible for the affected status of individuals from the largeproportion group. As a result, the exclusion of causative SNP pairs with insufficiently low Bonferronicorrected Ï‡ ^{2}â€™s pvalues from the output ensemble leads to a decrease in the number of correctlyidentified causative SNPs.
In contrast to the first two parameters, an increase in the number of causative SNPs leads to an increase in the number of correctlyidentified causative SNPs. This phenomenon can be explained as follows. To identify a multilocus interaction, a number of SNP pairs must be included in the output ensemble. For instance, an ensemble of three SNP pairs namely (SNP1, SNP2), (SNP2, SNP3) and (SNP1, SNP3) leads to the identification of a threelocus interaction between SNP1, SNP2 and SNP3. However, only twooutofthree SNP pairs are necessary for the correct interaction identification. Similarly, only threeoutofsix possible SNP pairs are necessary for the correct identification of a fourlocus interaction. In other words, the number of redundant SNP pairs increases as the order of interaction increases. Hence, the number of correctlyidentified causative SNPs increases when there are more redundant SNP pairs, which can be omitted from the output ensemble.
In addition to the superiority in terms of the number of output SNPs and the number of correctlyidentified causative SNPs, the computational time for 2LOmb analysis is tractable. The order of growth in 2LOmb computational time is O(m n ^{2}) where m is the sample size and n is the number of available SNPs. The computational time for RF analysis is also tractable. However, the order of growth in RF computational time is O(mlog(m\left)\sqrt{n}f\right) where f is the number of trees, signifying that the required computational time also depends on the algorithm setting (Guyon and Elisseef 2006). In contrast, the tractability of MDR depends on the maximum size of explored models. If MDR explores all possible SNP combinations, the order of growth in MDR computational time is O(m 2^{n}), which makes the computational time becomes intractable. On the other hand, if MDR explores only models that do not contain more than n _{ s } SNPs where n _{ s }<n, the order of growth in MDR computational time is O\left(m{n}^{{n}_{s}}\right), which means that the computational time is tractable. Since the computational time required by 2LOmb, RF and MDR with the latter setting is all tractable, the comparison of computational time is hence carried out. MDR explores only models that do not cover more than 10 SNPs in the 20SNP data sets. An MDR permutation test is also omitted because it requires large computational efforts and is only performed to assess the probability that the null hypothesis of no association is true. The summary of computational time required by all three techniques is given in Table 1. RF uses lesser computational time than 2LOmb while MDR uses more computational time than 2LOmb to analyse 20SNP data sets. However, the computational time required by RF to analyse 1,000SNP data sets is greater than that required by 2LOmb. In addition, by limiting the MDR analysis to the exploration of models that do not cover more than 10 SNPs, it is estimated using the present MDR result that the computational time required by MDR to analyse a 1,000SNP data set is 2.75Ã—10^{21} seconds. The present MDR result also suggests that the computational time required by MDR to perform a permutation test using 1,000 permutation replicates on a 20SNP data set is 6.37Ã—10^{6} seconds. The estimation of computational time conforms to the results from early reports (Edwards et al. 2009; Pattin and Moore 2008; Ritchie et al. 2001; Wongseree et al. 2009). This means that a direct application of MDR and RF used in this study (see Methods for details) to larger data sets in which all SNPs exhibit no marginal singlelocus effects is certainly impractical. Overall, 2LOmb outperforms MDR and RF in this study. There are many attribute selection techniques that have been successfully applied to genetic association studies (Heidema et al. 2006; Motsinger et al. 2007; Van Steen 2012). It would be interesting to benchmark 2LOmb against other techniques that can also be applied to data containing pure epistasis (Culverhouse 2012; Jiang et al. 2011b; Zhang and Liu 2007) and genetic heterogeneity (Culverhouse 2012).
Another advantage of using 2LOmb for detecting pure epistasis in the presence of genetic heterogeneity is the ability to identify the number of independent interactions. This is possible because 2LOmb reports its output in the form of an ensemble of SNP pairs. If there are common SNPs between pairs, then the detection of a multilocus interaction is declared. On the other hand, the absence of common SNPs between pairs signifies that the interactions are independent. For example, an ensemble that contains SNP pairs (SNP1, SNP2), (SNP3, SNP4) and (SNP4, SNP5) indicates the presence of genetic heterogeneity in which a twolocus interaction between SNP1 and SNP2 and a threelocus interaction between SNP3, SNP4 and SNP5 are independently responsible for the disease status of each individual. Obviously, it is impossible to directly identify the number of independent interactions from the MDR and RF results because both techniques report their outputs in the form of a set of SNPs and not a set of SNP pairs. To demonstrate this capability of 2LOmb, the previously described simulation is extended where the number of independent data sets for each simulation setting increases from 30 to 100. The portions of independent data sets in which 2LOmb can identify at least one interaction and both interactions in the data sets are obtained for the calculation of statistical power. Detection of one interaction is declared if 2LOmb correctly identifies at least two interacting causative SNPs responsible for the affected status of individuals from only one case group. On the other hand, detection of two interactions is declared if 2LOmb correctly identifies at least four interacting causative SNPs in the form of two SNP pairs without a common SNP among the pairs. In addition, each SNP pair must be responsible for the affected status of individuals from a different case group. The statistical power to detect genetic heterogeneity summarised in Table 2 indicates that 2LOmb can identify both interactions in nearly all 20SNP data sets. However, a loss of statistical power to detect both interactions is observed when the number of available SNPs increases from 20 to 1,000. In particular, this occurs when the ratio of case samples from two affected groups is 1:3 and a twolocus interaction is responsible for the affected status of individuals from the smallproportion group. This conforms to the early observation regarding the effects of increasing the number of available SNPs and increasing the number of causative SNPs on the number of correctlyidentified causative SNPs. In brief, the Bonferroni correction factor increases when the number of available SNPs increases. If the Bonferronicorrected Ï‡ ^{2}â€™s pvalue for a causative SNP pair is not low enough, this pair would be excluded from the output ensemble. Subsequently, a failure to identify the causative SNP pair that is solely responsible for the affected status of individuals from the smallproportion group leads to the reduction in statistical power to detect both interactions.
Testing with largescaled simulated data
In this part of the study, each simulated data set contains 10,000 or 100,000 unlinked SNPs where two independent purely epistatic twolocus interactions are present. Only the setting of two independent twolocus interactions is considered because the early simulation results given in Table 2 indicate that this scenario is the most difficult one when the number of available SNPs is large. The allele frequencies of all causative SNPs are 0.5 while the MAFs of the remaining SNPs are between 0.05 and 0.5. The data set consists of balanced casecontrol samples of size 1,600, 3,200 or 6,400. All SNPs in control samples are in HardyWeinberg equilibrium. The case samples are drawn from two independent groups of affected individuals where the ratios of samples from two affected groups are 1:1 and 1:3. The genotype distribution of interacting causative SNPs follows the purely epistatic model which gives the heritability of 0.01. One hundred independent data sets for each simulation setting are generated by genomeSIM for the evaluation of statistical power to detect genetic heterogeneity.
The summary of statistical power to detect genetic heterogeneity in Table 3 indicates that the ability to detect both interactions is highest when the ratio of case samples from two affected groups is 1:1 and is lowest when the sample size is 1,600 and the ratio of case samples from two affected groups is 1:3. This conforms to the early observation where a similar phenomenon is detected when the number of available SNPs is 1,000. However, once the sample size is doubled and quadrupled, the ability to detect both interactions increases significantly. Both interactions can be detected in almost all data sets containing 10,000 or 100,000 SNPs and 3,200 samples while both interactions can be detected in all data sets containing 10,000 or 100,000 SNPs and 6,400 samples. Obviously, an increase in sample size causes an increase in the Ï‡ ^{2} test statistic during the twolocus analysis of causative SNPs. This suggests that increasing the sample size leads to a lower Bonferronicorrected Ï‡ ^{2}â€™s pvalue for the SNP pair responsible for the affected status of individuals from the smallproportion group. Subsequently, the chance this SNP pair being included in the output ensemble increases.
Bonferroni correction during the twolocus analysis plays an important role in keeping the number of output SNPs reported by 2LOmb close to the number of causative SNPs (Wongseree et al. 2009). However, the overly conservative nature of Bonferroni correction when the number of statistical tests is large (Jiang et al. 2011a) also leads to the aforementioned limitation in 2LOmbâ€™s ability to detect both independent interactions when the ratio of case of samples from two affected groups is 1:3. Although increasing the sample size is a possible solution, other multiple testing correction techniques can be used instead of the Bonferroni correction to tackle this problem. For instance, false discovery rate (FDR) analysis is a strong candidate and is proven to be appropriate for DNA microarray data analysis (Storey and Tibshirani 2003) and genomewide association studies (The Diabetes Genetics Replication and Metaanalysis Consortium 2012). Further studies are required to determine the effect of replacing the Bonferroni correction in the twolocus analysis within 2LOmb with the FDR analysis.
The computational time summarised in Table 4 indicates that the computational time for the largescaled simulation is a linear function of sample size. This is to be expected because the construction of a 2 Ã— 9 contingency table for each twolocus analysis requires the assignment of samples to the appropriate cells in the table, which is a lineartime operation. On the other hand, the computational time is a quadratic function of the number of available SNPs. Since the basic operation of 2LOmb is the twolocus analysis, 2LOmb can tackle largescaled problems with fixed sample size and varied number of SNPs in quadratic time (Wongseree et al. 2009). Overall, the result agrees with the order of growth in computational time discussed in the smallscaled simulation section. Based on the computational time given in Table 4, it is estimated that 2LOmb requires 3.14Ã—10^{5} seconds (87.2 hours) of computational time to complete the analysis of a data set containing 500,000 SNPs and 6,400 samples. This suggests that the computational time of 2LOmb is tractable for genomewide association studies.
Analysis of type 1 diabetes mellitus data
The presence of pure epistasis and genetic heterogeneity in a T1D data set is identified using 2LOmb. The data set, which is collected and screened by the Wellcome Trust Case Control Consortium (WTCCC), consists of 1,963 case samples and 2,938 control samples. The case samples are collected from affected individuals in the UK while the control samples are the results of the merging between 1,458 samples from the UK blood services and 1,480 samples from the 1958 British birth cohort. The data set contains 469,557 SNPs, which are genotyped through the Affymetrix GeneChip 500K Mapping Array Set and pass the WTCCC quality control (The Wellcome Trust Case Control Consortium 2007). The SNP set is primarily reduced by screening for SNPs within or near genes (Herold et al. 2009; Ritchie 2011) according to NCBI build 36.3 (dbSNP b129) coordinates. SNPs that are near a gene are located within 2,000 bases upstream of the start site or 500 bases downstream of the termination site for transcription. The SNP set is further reduced by removing SNPs that exhibit marginal singlelocus effects or have MAFs below 0.1. SNPs that the genotype distribution within control samples departs from HardyWeinberg equilibrium are also discarded. The final SNP set contains 95,991 SNPs with no marginal singlelocus effects (uncorrected Ï‡ ^{2}â€™s pvalue > 0.05) from 12,146 genes.
The analysis of the reduced T1D data set by 2LOmb takes 8,862 seconds (2.46 hours) of computational time on the computer system with a graphics processing unit (see Table 4 for detailed computer specification). The possible genetic association is detected from 12 SNPs located within or near five genes (global pvalue < 0.0001). Details of these SNPs, the SNP pairs that exhibit marginal twolocus effects and the identified genes are given in Table 5. Linkage disequilibrium (LD) analysis is subsequently performed using JLIN (Carter et al. 2006) and the LD patterns are shown in Figure 3. All SNPs within or near the same gene are in LD due to high values of D ^{â€²} and r ^{2}. This is most likely being the cause of the identification of multiple SNPs from the same gene. On the other hand, SNPs in each pair that contains SNPs from different genes (SNP pairs 2â€“18) are not in LD due to low values of D ^{â€²} and r ^{2}. There are several subsets of these SNP pairs in which each subset contains three SNP pairs with common SNPs between pairs. One example is {(SNP1, SNP5), (SNP1, SNP7), (SNP1, SNP9)}. Consequently, the detection of these 17 SNP pairs indicates that there is a fourlocus interaction between MUC21 (mucin 21, cell surface associated), MUC22 (mucin 22), PSORS1C1 (psoriasis susceptibility 1 candidate 1) and TCF19 (transcription factor 19). In contrast, there are no interactions between ATAD1 (ATPase family, AAA domain containing 1) and the other genes due to the absence of a SNP pair containing a SNP from ATAD1 and a SNP from any of the remaining four genes. The detection of three linked SNPs within ATAD1 is believed to be the result of haplotype effects (Epstein and Satten 2003). Altogether, this clearly signifies the presence of pure epistasis and genetic heterogeneity. In real data analysis, the detection of a SNP pair that associates with the disease is insufficient to claim the presence of pure epistasis. If the SNP pair consists of two unlinked SNPs, then the detection of pure epistasis can be declared. Otherwise, the detection is the result of LD between SNPs. Since 2LOmb analysis cannot solely distinguish genetic association due to pure epistasis from genetic association due to LD, it is crucial to always perform additional LD analysis.
The first four genes identified by 2LOmb, namely MUC21, MUC22, PSORS1C1 and TCF19, are located on the major histocompatibility complex (MHC). MHC is a genomic region in which a mouse model of human complex diseases suggests the presence of T1D susceptibility genes (Cordell et al. 2001). The four genes are also located between DDR1 (discoidin domain receptor tyrosine kinase 1) and HLADQA1 (major histocompatibility complex, class II, DQ alpha 1), which is the region where the DR3DQ2 ancestral haplotype 18.2 (AH18.2) is proven to be highly conserved and likely to carry susceptibility alleles for T1D in a Spanish population (Santiago et al. 2009). This implies that the detection of a fourlocus interaction between these four genes conforms to the evidence from early genetic association studies of T1D. On the other hand, there are no early reports regarding the association between ATAD1 polymorphisms and T1D. ATAD1 is among many candidate genes for the association studies of Parkinsonâ€™s disease. Nonetheless, there is little information about pathways that include ATAD1 (Moran and Graeber 2008). Hence, it is impossible to explain the association between ATAD1 polymorphisms and T1D at this point.
This study produces evidence of association between 12 SNPs within or near MUC21, MUC22, PSORS1C1, TCF19 and ATAD1, and T1D in a UK population. Although there are other independent genomewide T1D data sets, the association detection within these data sets using the presented methodology has never been attempted. Basically, the methodology employed in most genomewide association studies is based on singlelocus analysis (Cooper et al. 2008; The Wellcome Trust Case Control Consortium 2007). Since each SNP explored in the reduced T1D data set exhibits no marginal singlelocus effect, the most direct approach for replicating the association results presented in this article is to perform the same association detection on these independent data sets. This would certainly help to gain a further insight into the genetics of T1D.
Conclusions
In this article, the detection of pure epistasis (Culverhouse et al. 2002) in the presence of genetic heterogeneity is investigated. The study focuses on the capability to detect two independent interactions that influence the development of the same complex disease. Each interaction can be either a purely epistatic twolocus interaction or a purely epistatic multilocus interaction in which the causative SNPs exhibit no marginal singlelocus effects. The candidate techniques for the detection benchmarking are MDR (Ritchie et al. 2001), RF (Breiman 2001) and 2LOmb (Wongseree et al. 2009). The results from various simulation scenarios indicate that 2LOmb outperforms MDR and RF in terms of a low number of output SNPs and a high number of correctlyidentified causative SNPs. These scenarios are created by varying the number of available SNPs in data, the number of causative SNPs and the ratio of case samples from two affected groups. ANOVA reveals that all three simulation parameters influence the number of correctlyidentified causative SNPs in the 2LOmb output. In addition to the superiority in the detection performance, 2LOmb is also capable of identifying the number of independent interactions. This is achieved through the identification of common SNPs among SNP pairs in the ensemble. The results indicate that 2LOmb is able to identify the presence of independent interactions even though the number of available SNPs reaches 100,000. Moreover, this is achieved in tractable computational time, which makes 2LOmb suitable for use in genomewide association studies. 2LOmb is subsequently applied to a T1D data set, which contains 1,963 case samples and 2,938 control samples and is collected from a UK population (The Wellcome Trust Case Control Consortium 2007). The genomewide data set is primarily screened for SNPs that locate within or near genes. The data set is further reduced by removing SNPs that exhibit marginal singlelocus effects or have MAFs below 0.1. The final data set contains 95,991 SNPs from 12,146 genes. 2LOmb identifies 12 SNPs that are associated with the disease. These SNPs are located within or near MUC21, MUC22, PSORS1C1, TCF19 and ATAD1. 2LOmb and LD analyses indicate that there is a fourlocus interaction between MUC21, MUC22, PSORS1C1 and TCF19 while SNPs from ATAD1 are independently associated with the disease. This signifies the presence of both pure epistasis and genetic heterogeneity. The evidence of genetic association for these five genes provides an alternative explanation for the aetiology of T1D in the UK population. It also confirms that SNPs which exhibit no marginal singlelocus effects from a genomewide data set can be useful for genetic association studies (Wongseree et al. 2009).
Methods
Purely epistatic model
A purely epistatic model is first defined by Culverhouse et al. (2002). The model describes an interaction between unlinked SNPs which leads to an epistatic effect while each interacting SNP exhibits no marginal singlelocus effect. As a result, it is impossible to screen for SNPs contributing to pure epistasis by by singlelocus Ï‡ ^{2} tests for allelic and genotypic association. However, pure epistasis can be detected by multilocus analysis. In this study, each model contains two, three or four causative SNPs. The purely epistatic three and fourlocus interaction models also exhibit marginal twolocus effects. All models yield the heritability of 0.01, which implies that genetic factors partially contribute towards disease susceptibility. The penetrance tables, which define the probability that an individual with a specific genotype has the disease, for purely epistatic two, three and fourlocus interaction models used throughout the simulations are given in Tables 6, 7 and 8, respectively. Detailed derivation of these models is given in Culverhouse et al. (2002) and Wongseree et al. (2009).
genomeSIM
genomeSIM is a software package for simulating casecontrol data in genetic association studies (Dudek et al. 2006). genomeSIM takes penetrancebased models as inputs necessary for dictating the case/control status of each sample. A casecontrol data set can be generated by a populationbased simulation or a probabilitybased simulation. A population of genotype strings is initialised according to the allele frequency of each SNP in the populationbased simulation. Successive generations are subsequently created through a forwardtime simulation by crossing the genotype strings within each generation. This is pursued until the predefined number of generations is reached. On the other hand, genotype strings are incrementally created until the predefined numbers of case and control samples are obtained in the probabilitybased simulation. In this study, the probabilitybased simulation is employed to generate all casecontrol data sets. genomeSIM is available upon request to the Ritchie Lab, Center for System Genomics, Pennsylvania State University (Ritchie Lab 2013).
Multifactor dimensionality reduction
MDR is a wrapper technique which is capable of identifying causative SNPs that are associated with a disease from casecontrol data (Ritchie et al. 2001). MDR functions by attempting to identify the best SNP combination that yields the highest prediction accuracy. The prediction accuracy is calculated by means of a 10fold crossvalidation. During the crossvalidation, the data set is randomly divided into 10 folds of combined casecontrol samples in which 9 folds of samples are used to construct the prediction model while the remaining fold is used to test the model. The process of prediction model construction and testing is then repeated 10 times where for each time a different sample fold is chosen as the testing fold. The prediction model embedded in MDR is a multidimensional decision table with {3}^{{n}_{c}} cells when n _{ c } SNPs and all three possible genotypes according to each SNP are considered. Each cell in the decision table is filled with case and control samples for which their genotypes coincide with the cell labels. The ratio between the numbers of case and control samples dictates whether the genotype in each cell is a protective or diseasepredisposing genotype. The prediction accuracy is then evaluated by counting the number of testing samples that their disease status can be correctly identified using the decision rules provided by the table.
Similar to other wrapper techniques, the total number of possible prediction models that MDR can explore is 2^{n}âˆ’1 where n is the number of available SNPs in the data set. With the use of an exhaustive search, MDR can generally identify the best SNP combination that gives the highest prediction accuracy. However, the search for the best model can also be limited to models that do not cover more than n _{ s } SNPs where n _{ s }<n. After exploring multiple prediction models with a fixed number of SNPs, MDR also returns an additional measure called crossvalidation consistency. Basically, each time that a testing fold is used to determine the accuracy of the interesting prediction model, the attained accuracy can be compared with that from other models which have the same number of SNPs as the interesting model. The model with high crossvalidation consistency is the one that consistently ranks the first in comparison to other models regardless of which testing fold being used. A model with high crossvalidation consistency usually has high prediction accuracy. As a result, prediction accuracy remains the principal criterion for model selection while crossvalidation consistency is only applied as an auxiliary criterion. If two or more SNP combinations give the highest prediction accuracy and have equally high crossvalidation consistency, the most parsimonious combinationâ€”the combination with the least number of SNPsâ€”is the one chosen as the best SNP combination.
A permutation test can subsequently be applied to estimate the probability that the null hypothesis of no association is true. Each permutation replicate is constructed by randomly assigning the case/control status to each sample with the constraint that the numbers of case and control samples must remain unchanged. MDR is then performed on each permutation replicate to obtain the best SNP combination together with its prediction accuracy and crossvalidation consistency. The empirical pvalue is given by the fraction of permutation replicates with the interesting measure larger than or equal to that obtained from the original data where the measure can be either prediction accuracy or crossvalidation consistency (Hahn et al. 2003). MDR used in this study is publicly available from the Computational Genetics Laboratory, Dartmouth Medical School, Dartmouth College (Computational Genetics Laboratory at Dartmouth Medical School 2013).
Random forest
RF refers to a collection or ensemble of decision trees (Breiman 2001). Each tree in RF is constructed in a topdown manner. The tree construction begins at the root node where an attribute (SNP) is selected as the test. Descendants of the root node are then created according to the values of this attribute (genotypes of this SNP). Next, the (casecontrol) data samples are sorted to the appropriate descendant node. The entire process is repeated using the samples associated with each descendant node to select another attribute to test at that point in the tree. This forms a forward search for an acceptable decision tree in which the search never backtracks to reconsider earlier node choices. Since there are multiple trees in the forest, RF takes a majority vote from the trees as the class decision. Hence, the trees should be diverse in order for the majorityvote concept to be applicable. It is suggested that an attribute for each node in a tree can be selected according to its suitability for being used as the test from a small group of randomly picked attributes. Empirical studies indicate that an attribute group size of \xe2\u0152\u02c6\sqrt{\text{total number of attributes}}\xe2\u0152\u2030 is sufficient. Consequently, the samples allocated to each descendant node, which is created after selecting the most suitable attribute as the test, have lesser class variety. Moreover, each tree in RF is allowed to grow to its maximum size. This does not lead to data overfitting because the overall class decision relies on outcomes from multiple trees in the forest.
Unlike MDR, a bootstrap aggregating or bagging approach provides a means to determine the prediction accuracy of RF. Given a (casecontrol) sample set, a bootstrap sample set with the size equals to original sample set is generated by sampling from the original sample set uniformly and with replacement. It is expected that 63.2% of bootstrap samples are unique while the remaining samples are duplicates. Original samples that are absent from the bootstrap sample set are referred to as outofbag samples. Bootstrap samples are employed during the tree construction while outofbag samples are used to evaluate the prediction accuracy. A new bootstrap sample set is generated for the construction of each tree. As a result, the votes are only counted across the trees that the sample is outofbag during the prediction accuracy evaluation. The application of a bootstrap aggregating approach also leads to a means to quantify attribute importance, which is commonly referred to as variable importance. The variable importance is measured using a permutation approach. By randomly permuting the value of the attribute of interest, the correlation between the attribute and the (casecontrol) class can be determined. When the permuted attribute and the remaining nonpermuted attributes are used as inputs for RF to identify the class of outofbag samples, the prediction accuracy reduces markedly if the attribute of interest is correlated with the class. The average difference between the prediction accuracy obtained using the original attribute inputs and that obtained using the inputs with one permuted attribute over the trees is the variable importance. The standardised variable importance is defined as the quotient between the variable importance and a standard error derived from the betweentree variance of the variable importance. In other words, the standardised variable importance follows a standard normal distribution (Random Forests 2004). An attribute with variable importance in the top five percentiles of a normal distribution is considered to be in a top rank in comparison to other attributes and is hence correlated with the class. This decision criterion is similar to the one based on the extremity of variable importance suggested by Strobl et al. (2009). RF used in this study is publicly available from the Department of Statistics, University of California, Berkeley (Random Forests 2004). A review of RF for genetic association studies can be found in Goldstein et al. (2011). Interested readers should also refer to Schwarz et al. (2010) and Wei et al. (2013) for RFbased techniques that are computationally feasible for genomewide association studies.
Omnibus permutation test on ensembles of twolocus analyses
2LOmb is a filter technique which is specifically designed for detecting pure epistasis in casecontrol data (Wongseree et al. 2009). 2LOmb consists of four steps as follows.
Twolocus analysis
2LOmb begins by exhaustively performing twolocus analysis by Ï‡ ^{2} tests. Each Ï‡ ^{2} test determines the difference between the distribution of twolocus genotypes in case and control samples. For a casecontrol data set containing n SNPs, \left(\genfrac{}{}{0.0pt}{}{n}{2}\right) twolocus analyses are attained. Subsequently, the Ï‡ ^{2}â€™s pvalue from each twolocus analysis is adjusted by a Bonferroni correction. The Bonferronicorrected Ï‡ ^{2}â€™s pvalue from each twolocus analysis is min(\left(\genfrac{}{}{0.0pt}{}{n}{2}\right)\xc3\u2014 uncorrected Ï‡ ^{2}â€™s pvalue, 1).
Permutation test
A permutation test is performed to test the null hypothesis {H}_{0}^{e} that the ensemble e of twolocus analyses is not associated with the disease. To achieve this, a scalar statistic is first computed for the original casecontrol data set by combining Bonferronicorrected Ï‡ ^{2}â€™s pvalues for SNP pairs through a Fisherâ€™s combining function (\xe2\u02c6\u20192\underset{i}{\xe2\u02c6\u2018}log\left({p}_{i}\right)). The calculation of the Fisherâ€™s test statistic is then repeated for a set of permutation replicates. Each permutation replicate is constructed by randomly permuting the case/control status of each sample, which leads to different Bonferronicorrected Ï‡ ^{2}â€™s pvalues and Fisherâ€™s test statistic. The pvalue of the null hypothesis {H}_{0}^{e} is then given by ?
where {T}_{i}^{e} is the Fisherâ€™s test statistic calculated for the permutation replicate i, {T}_{0}^{e} is the Fisherâ€™s test statistic calculated for the original casecontrol data set, t is the number of permutation replicates and Â· denotes the size of a set.
Global pvalue determination
Since multiple ensembles of twolocus analyses can be explored, the calculation of global pvalue is required to adjust for multiple hypothesis testing. The result is the pvalue of the global null hypothesis {H}_{0}=\underset{1\xe2\u2030\xa4e\xe2\u2030\xa4E}{\xe2\u2039\u201a}{H}_{0}^{e} in which none of E explored ensembles is associated with the disease. Similar to other omnibus permutation tests, the same set of permutation replicates that gives the raw or unadjusted pvalue for each ensemble is also used to estimate the global pvalue. To obtain the global pvalue, the unadjusted pvalue for the permutation replicate i of each hypothesis {H}_{0}^{e} is first calculated from
The pvalue of the global null hypothesis H _{0} is then given by
where {p}_{i}^{min}=\underset{e}{min}{p}_{i}^{e} is the minimum of unadjusted pvalues over the explored ensembles in the permutation replicate i and {p}_{0}^{min}=\underset{e}{min}{p}_{0}^{e} is the minimum of raw pvalues over the explored ensembles in the original casecontrol data set.
Search for the best ensemble of twolocus analyses
The search for the best ensemble of twolocus analyses initialises by selecting the SNP pair with the lowest Bonferronicorrected Ï‡ ^{2}â€™s pvalue, which is a part of result from the first step of algorithm. A permutation test is then performed for this twolocus analysis, yielding both raw and global pvalues because only one hypothesis has been explored. If the raw and global pvalues of this first ensemble are statistically insignificant, the search terminates and the null hypothesis of no association cannot be rejected. Otherwise, the search continues by merging the SNP pair with the next lowest Bonferronicorrected Ï‡ ^{2}â€™s pvalue to the current best ensemble and reevaluating the raw and global pvalues. The search continues progressively in this manner until either an increase in the raw or global pvalue is observed or all possible SNP pairs are included in the ensemble. If the search terminates prior to the inclusion of all possible SNP pairs, the best ensemble is the one from the previous iteration.
In this study, the significance level (Î±) to determine whether an ensemble is associated with the disease is 0.05 and the number of permutation replicates is 10,000, which is proven to be sufficient in the early study (Wongseree et al. 2009). 2LOmb is publicly available from its homepage (Detecting Purely Epistatic Multilocus Interactions by an Omnibus Permutation Test on Ensembles of Twolocus Analyses 2009).
Java LINkage disequilibrium plotter
A Java LINkage disequilibrium plotter (JLIN) is a software package for the illustration of linkage disequilibrium patterns (Carter et al. 2006). JLIN is used to display D ^{â€²} and r ^{2} calculated for SNPs which are associated with T1D. JLIN is publicly available from the Centre for Genetic Epidemiology and Biostatistics, University of Western Australia (JLIN 2010).
Authorsâ€™ information
DS is a Ph.D. student at the Department of Electrical and Computer Engineering, Faculty of Engineering, King Mongkutâ€™s University of Technology North Bangkok. He also received his B.Eng. degree in computer engineering from King Mongkutâ€™s University of Technology North Bangkok. His research interests include machine learning, bioinformatics and genetic epidemiology.
PT received his B.Eng. degree in computer engineering from King Mongkutâ€™s University of Technology North Bangkok. His research interests include machine learning and genetic epidemiology.
NJ is a software developer at the Department of Computer Engineering, Faculty of Engineering, King Mongkutâ€™s University of Technology Thonburi. He received his B.Eng. degree in computer engineering and M.Eng. degree in electrical engineering from King Mongkutâ€™s University of Technology North Bangkok. His research interests include high performance computing and bioinformatics.
SK received his B.Eng. degree in computer engineering from King Mongkutâ€™s University of Technology North Bangkok. His research interests include machine learning and genetic epidemiology.
WW is a lecturer at the Division of Technology of Information System Management, Faculty of Engineering, Mahidol University. He received his B.Eng., M.Eng. and Ph.D. degrees in electrical engineering from King Mongkutâ€™s University of Technology North Bangkok. His research interests include machine learning, evolutionary computation and bioinformatics.
TP is a postdoctoral researcher at the Department of Electrical and Computer Engineering, Faculty of Engineering, King Mongkutâ€™s University of Technology North Bangkok. He also received his B.Eng. and M.Eng. degrees in production engineering as well as his Ph.D. degree in electrical engineering from King Mongkutâ€™s University of Technology North Bangkok. His research interests include evolutionary multiobjective optimisation and machine learning.
TU received his B.Eng. and M.Eng. degrees in electrical engineering from King Mongkutâ€™s University of Technology North Bangkok. His research interests include machine learning, bioinformatics and genetic epidemiology.
CL is the Head of Division of Molecular Genetics at the Department of Research and Development, Faculty of Medicine Siriraj Hospital, Mahidol University. He also received his M.D. degree from Mahidol University. His research interests include human genetics and genetic diseases.
CA is an assistant professor of computer science at the Department of Mathematics and Computer Science, Chulalongkorn University. He also received his B.Eng., M.Eng. and Ph.D. degrees in computer engineering from Chulalongkorn University. His research interests include machine learning, evolutionary computation and bioinformatics.
MP is an assistant professor of computer engineering at the Department of Computer Engineering, Faculty of Engineering, King Mongkutâ€™s University of Technology Thonburi. He received his B.A. degree in electrical engineering from Brown University and received his M.Sc. and Ph.D. degrees in electrical and computer engineering from University of WisconsinMadison. His research interests include integrated circuit testing, fault tolerant systems, enterprise software development and high performance computing.
NC is an associate professor of electrical engineering at King Mongkutâ€™s University of Technology North Bangkok and an adjunct professor of genetic epidemiology at Mahidol University. He received his B.Eng. and Ph.D. degrees from the Department of Automatic Control and Systems Engineering, University of Sheffield. His research interests include evolutionary computation, machine learning and genetic epidemiology.
Abbreviations
 2LOmb:

Omnibus permutation test on ensembles of twolocus analyses
 AH:

Ancestral haplotype
 ANOVA:

Analysis of variance
 ATAD1:

ATPase family, AAA domain containing 1
 dbSNP:

Single Nucleotide Polymorphism Database
 DDR1:

Discoidin domain receptor tyrosine kinase 1
 DNA:

Deoxyribonucleic acid
 FDR:

False discovery rate
 genomeSIM:

Simulation package for generating casecontrol samples in genetic association studies
 GYS2:

Glycogen synthase 2 (liver)
 HLADQA1:

Major histocompatibility complex, class II, DQ alpha 1
 JLIN:

Java LINkage disequilibrium plotter
 LD:

Linkage disequilibrium
 LMX1A:

LIM homeobox transcription factor 1, alpha
 MAF:

Minor allele frequency
 MDR:

Multifactor dimensionality reduction
 MHC:

Major histocompatibility complex
 MUC21:

Mucin 21, cell surface associated
 MUC22:

Mucin 22
 NCBI:

National Center for Biotechnology Information
 PARK2:

Parkinson protein 2, E3 ubiquitin protein ligase (parkin)
 PGM1:

Phosphoglucomutase 1
 PSORS1C1:

Psoriasis susceptibility 1 candidate 1
 RF:

Random forest
 SNP:

Single nucleotide polymorphism
 T1D:

Type 1 diabetes mellitus
 T2D:

Type 2 diabetes mellitus
 TCF19:

Transcription factor 19.
References
Breiman L: Random forests. Mach Learn 2001, 45: 532. 10.1023/A:1010933404324
Carter KW, McCaskie PA, Palmer LJ: JLIN: a java based linkage disequilibrium plotter. BMC Bioinformatics 2006, 7: 60. 10.1186/14712105760
Cho YM, Ritchie MD, Moore JH, Park JY, Lee KU, Shin HD, Lee HK, Park KS: Multifactordimensionality reduction shows a twolocus interaction associated with type 2 diabetes mellitus. Diabetologia 2004, 47: 549554. 10.1007/s0012500313213
Computational Genetics Laboratory at Dartmouth Medical School 2013.http://www.epistasis.org/
Cordell HJ: Epistasis: what it means, what it doesnâ€™t mean, and statistical methods to detect it in humans. Hum Mol Genet 2002, 11: 24632468. 10.1093/hmg/11.20.2463
Cordell HJ: Detecting genegene interactions that underlie human diseases. Nat Rev Genet 2009, 10: 392404. 10.1038/nrg2579
Cordell HJ, Todd JA, Hill NJ, Lord CJ, Lyons PA, Peterson LB, Wicker LS, Clayton DG: Statistical modeling of interlocus interactions in a complex disease: rejection of the multiplicative model of epistasis in type 1 diabetes. Genetics 2001, 158: 357367.
Cooper JD, Smyth DJ, Smiles AM, Plagnol V, Walker NM, Allen JE, Downes K, Barrett JC, Healy BC, Mychaleckyj JC, Warram JH, Todd JA: Metaanalysis of genomewide association study data identifies additional type 1 diabetes risk loci. Nat Genet 2008, 40: 13991401. 10.1038/ng.249
Culverhouse RC: A comparison of methods sensitive to interactions with small main effects. Genet Epidemiol 2012, 36: 303311. 10.1002/gepi.21622
Culverhouse R, Suarez BK, Lin J, Reich T: A perspective on epistasis: limits of models displaying no main effect. Am J Hum Genet 2002, 70: 461471. 10.1086/338759
Detecting Purely Epistatic Multilocus Interactions by an Omnibus Permutation Test on Ensembles of Twolocus Analyses 2009.http://code.google.com/p/nachol/wiki/DetectingPurelyEpistatic
Dudek SM, Motsinger AA, Velez DR, Williams SM, Ritchie MD: Data simulation software for wholegenome association and other studies in human genetics. In Proceedings of the Pacific Symposium on Biocomputing 2006: 3â€“7 January 2006;. Edited by: Altman RB, Dunker AK, Hunter L, Murray T, Klein TE. Maui, World Scientific, Singapore; 2006. pp 499â€“510
Edwards TL, Lewis K, Velez DR, Dudek SM, Ritchie MD: Exploring the performance of multifactor dimensionality reduction in large scale SNP studies and in the presence of genetic heterogeneity among epistatic disease models. Hum Hered 2009, 67: 183192. 10.1159/000181157
Epstein MP, Satten GA: Inference on haplotype effects in casecontrol studies using unphased genotype data. Am J Hum Genet 2003, 73: 13161329. 10.1086/380204
Evans DM, Marchini J, Morris AP, Cardon LR: Twostage twolocus models in genomewide association. PLoS Genet 2006, 2: e157. 10.1371/journal.pgen.0020157
Fisher RA: The correlation between relatives on the supposition of Mendelian inheritance. Trans R Soc Edinb 1918, 52: 399433.
GayÃ¡n J, GonzÃ¡lezPÃ©rez A, Bermudo F, SÃ¡ez ME, Royo JL, Quintas A, Galan JJ, MorÃ³n FJ, RamirezLorca R, Real LM, Ruiz A: A method for detecting epistasis in genomewide studies using casecontrol multilocus association analysis. BMC Genomics 2008, 9: 360. 10.1186/147121649360
Goldstein BA, Polley EC, Briggs FBS: Random forests for genetic association studies. Stat Appl Genet Mol Biol 2011, 10: 32.
Guyon I, Elisseef A: An introduction to feature extraction. In Feature extraction: foundations and applications. Edited by: Nikravesh M, Guyon I, Gunn S, Nikravesh M , Zadeh LA. Springer, Berlin, Heidelberg; 2006. pp 1â€“25. [Kacprzyk J (Series Editors): Studies in Fuzziness and Soft Computing, vol 207]
Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions. Bioinformatics 2003, 19: 376382. 10.1093/bioinformatics/btf869
Hall MA, Holmes G: Benchmarking attribute selection techniques for discrete class data mining. IEEE Trans Knowl Data Eng 2003, 15: 14371447. 10.1109/TKDE.2003.1245283
HallgrÃmsdÃ³ttir IB, Yuster DS: A complete classification of epistatic twolocus models. BMC Genet 2008, 9: 17.
Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der A DL, Feskens EJM: The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet 2006, 7: 23.
Herold C, Steffens M, Brockschmidt FF, Baur MP, Becker T: INTERSNP: genomewide interaction analysis guided by a priori information. Bioinformatics 2009, 25: 32753281. 10.1093/bioinformatics/btp596
Hoh J, Wille A, Ott J: Trimming, weighting, and grouping SNPs in human casecontrol association studies. Genome Res 2001, 11: 21152119. 10.1101/gr.204001
Ionita I, Man M: Optimal twostage strategy for detecting interacting genes in complex diseases. BMC Genet 2006, 7: 39.
Jiang X, Neapolitan RE: Mining pure, strict epistatic interactions from highdimensional datasets: ameliorating the curse of dimensionality. PLoS One 2012, 7: e46771. 10.1371/journal.pone.0046771
Jiang X, Barmada MM, Cooper GF, Becich MJ: A Bayesian method for evaluating and discovering disease loci associations. PLoS One 2011a, 6: e22075. 10.1371/journal.pone.0022075
Jiang X, Neapolitan RE, Barmada MM, Visweswaran S: Learning genetic epistasis using Bayesian network scoring criteria. BMC Bioinformatics 2011b, 12: 89. 10.1186/147121051289
JLIN: A java based linkage disequilibrium plotter. 2010.http://www.genepi.meddent.uwa.edu.au/software/jlin/
Kwon MS, Kim K, Lee S, Park T: cuGWAM: genomewide association multifactor dimensionality reduction using CUDAenabled highperformance graphics processing unit. Int J Data Min Bioinform 2012, 6: 471481. 10.1504/IJDMB.2012.049301
Li W, Reich J: A complete enumeration and classification of twolocus disease models. Hum Hered 2000, 50: 334349. 10.1159/000022939
Liu Y, Xu H, Chen S, Chen X, Zhang Z, Zhu Z, Qin X, Hu L, Zhu J, Zhao GP, Kong X: Genomewide interactionbased association analysis identified multiple new susceptibility loci for common diseases. PLoS Genet 2011, 7: e1001338. 10.1371/journal.pgen.1001338
Lunetta KL, Hayward LB, Segal J, van Eerdewegh P: Screening largescale association study data: exploiting interactions using random forests. BMC Genet 2004, 5: 32.
Marchini J, Donnelly P, Cardon LR: Genomewide strategies for detecting multiple loci that influence complex diseases. Nat Genet 2005, 37: 413417. 10.1038/ng1537
Meng YA, Yu Y, Cupples LA, Farrer LA, Lunetta KL: Performance of random forest when SNPs are in linkage disequilibrium. BMC Bioinformatics 2009, 10: 78. 10.1186/147121051078
Moore JH: A global view of epistasis. Nat Genet 2005, 37: 1314. 10.1038/ng010513
Moore JH, White BC: Tuning ReliefF for genomewide genetic analysis. In Evolutionary computation, machine learning and data mining in bioinformatics. Edited by: Marchiori E, Moore JH, Rajapakse JC. Springer, Berlin, Heidelberg; 2007. pp 166â€“175. [Goos, G, Hartmanis J, van Leeuwen J, (Founding and Former Series Editors): Lecture Notes in Computer Science, vol 4447]
Moran JB, Graeber MB: Towards a pathway definition of Parkinsonâ€™s disease: a complex disorder with links to cancer, diabetes and inflammation. Neurogenetics 2008, 9: 113. 10.1007/s100480070116y
Motsinger AA, Ritchie MD, Reif DM: Novel methods for detecting epistasis in pharmacogenomics studies. Pharmacogenomics 2007, 8: 12291241. 10.2217/14622416.8.9.1229
Musani SK, Shriner D, Liu N, Feng R, Coffey CS, Yi N, Tiwari HK, Allison DB: Detection of gene Ã— gene interactions in genomewide association studies of human population data. Hum Hered 2007, 63: 6784. 10.1159/000099179
Neuman RJ, Rice JP: Twolocus models of disease. Genet Epidemiol 1992, 9: 347365. 10.1002/gepi.1370090506
Pattin KA, Moore JH: Exploiting the proteome to improve the genomewide genetic analysis of epistasis in common human diseases. Hum Genet 2008, 124: 1929. 10.1007/s0043900805228
Random Forests 2004.http://www.stat.berkeley.edu/~breiman/RandomForests/
Ritchie Lab 2013.http://ritchielab.psu.edu/
Ritchie MD: Using biological knowledge to uncover the mystery in the search for epistasis in genomewide association studies. Ann Hum Genet 2011, 75: 172182. 10.1111/j.14691809.2010.00630.x
Ritchie MD, Edwards TL, Fanelli TJ, Motsinger AA: Genetic heterogeneity is not as threatening as you might think. Genet Epidemiol 2007, 31: 797800. 10.1002/gepi.20256
Ritchie MD, Hahn LW, Moore JH: Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet Epidemiol 2003, 24: 150157. 10.1002/gepi.10218
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet 2001, 69: 138147. 10.1086/321276
Saeys Y, Inza I, LarraÃ±aga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23: 25072517. 10.1093/bioinformatics/btm344
Santiago JL, Li W, Lee A, Martinez A, Chandrasekaran A, FernandezArquero M, Khalili H, de la Concha EG, Urcelay E, Gregersen PK: Localization of type 1 diabetes susceptibility in the ancestral haplotype 18.2 by high density SNP mapping. Genomics 2009, 94: 228232. 10.1016/j.ygeno.2009.06.007
Schork NJ, Boehnke M, Terwilliger JD, Ott J: Twotraitlocus linkage analysis: a powerful strategy for mapping complex genetic traits. Am J Hum Genet 1993, 53: 11271136.
Schwarz DF, KÃ¶nig IR, Ziegler A: On safari to random jungle: a fast implementation of random forests for highdimensional data. Bioinformatics 2010, 26: 17521758. 10.1093/bioinformatics/btq257
Sha Q, Zhang Z, Schymick JC, Traynor BJ, Zhang S: Genomewide association reveals three SNPs associated with sporadic amyotrophic lateral sclerosis through a twolocus analysis. BMC Med Genet 2009, 10: 86.
Storey JD, Tibshirani R: Statistical significance for genomewide studies. Proc Natl Acad Sci USA 2003, 100: 94409445. 10.1073/pnas.1530509100
Strobl C, Zeileis A: Danger: high power! â€“ exploring the statistical properties of a test for random forest variable importance. In COMPSTAT 2008 â€“ Proceedings in Computational Statistics, Volume 2: 24â€“29 August 2008. Edited by: Brito P. Porto, PhysicaVerlag, Heidelberg; 2008. pp 59â€“66
Strobl C, Malley J, Tutz G: An introduction to recursive partitioning: rationale, application and characteristics of classification and regression trees, bagging and random forests. Psychol Methods 2009, 14: 323348.
The Diabetes Genetics Replication and Metaanalysis Consortium: Largescale association analysis provides insights into the genetic architecture and pathophysiology of type 2 diabetes. Nat Genet 2012, 44: 981990. 10.1038/ng.2383
The International HapMap Consortium: A haplotype map of the human genome. Nature 2005, 437: 12991320. 10.1038/nature04226
The Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447: 661678. 10.1038/nature05911
Van Steen K: Travelling the world of genegene interactions. Brief Bioinform 2012, 13: 119. 10.1093/bib/bbr012
Verhoeven KJF, Casella G, McIntyre LM: Epistasis: obstacle or advantage for mapping complex traits? PLoS One 2010, 5: e12264. 10.1371/journal.pone.0012264
Wei C, Schaid DJ, Lu Q: Trees assembling MannWhitney approach for detecting genomewide joint association among lowmarginaleffect loci. Genet Epidemiol 2013, 37: 8491. 10.1002/gepi.21693
Wongseree W, Assawamakin A, Piroonratana T, Sinsomros S, Limwongse C, Chaiyaratana N: Detecting purely epistatic multilocus interactions by an omnibus permutation test on ensembles of twolocus analyses. BMC Bioinformatics 2009, 10: 294. 10.1186/1471210510294
Zhang Y, Liu JS: Bayesian inference of epistatic interactions in casecontrol studies. Nat Genet 2007, 39: 11671173. 10.1038/ng2110
Zhang Z, Zhang S, Wong MY, Wareham NJ, Sha Q: An ensemble learning approach jointly modeling main and interaction effects in genetic association studies. Genet Epidemiol 2008, 32: 285300. 10.1002/gepi.20304
Acknowledgements
The authors are extremely grateful to two anonymous reviewers and Prof. Justine Shults for their valuable comments and suggestions, which have contributed a lot towards improving the content and presentation of this article. This study makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of investigators who contributed to the generation of data is available from http://www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113. DS was supported by the Thailand Research Fund (TRF) through the Royal Golden Jubilee Ph.D. Programme (Grant No. PHD/1.E.KN.51/A.1). TU was supported by the Faculty of Engineering of the King Mongkutâ€™s University of Technology North Bangkok. CL was supported by the Mahidol Research Grant. NC was supported by the Thailand Research Fund, Office of the Higher Education Commission and Faculty of Engineering of the King Mongkutâ€™s University of Technology North Bangkok.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authorsâ€™ contributions
DS performed the smallscaled simulations, largescaled simulations and statistical analysis. PT performed the smallscaled simulations and monitored the execution of computer programs on the computer server. NJ performed the largescaled simulations, analysed the T1D data and monitored the execution of 2LOmb on the computer system with a graphics processing unit. SK performed the smallscaled simulations and summarised the results. WW performed the statistical analysis and provided comments about genetic association studies. TP performed the statistical analysis and provided comments about experimental design. TU performed the smallscaled simulations and commented on the results. CL provided additional comments about the genetic association study of T1D. CA provided comments about the manuscript. MP assisted in parallelising 2LOmb and handling largescaled data. NC conducted the literature survey, formulated the research question, designed the experiment, discussed all results, drew the conclusions and wrote the manuscript. All authors read and approved the final manuscript.
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Setsirichok, D., Tienboon, P., Jaroonruang, N. et al. An omnibus permutation test on ensembles of twolocus analyses can detect pure epistasis and genetic heterogeneity in genomewide association studies. SpringerPlus 2, 230 (2013). https://doi.org/10.1186/219318012230
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/219318012230
Keywords
 Attribute selection
 Complex disease
 Epistasis
 Genetic heterogeneity
 Genomewide association study
 Pattern recognition
 Permutation test
 Single nucleotide polymorphism
 Type 1 diabetes mellitus