Inference of biological networks using Bi-directional Random Forest Granger causality
© Furqan and Siyal. 2016
Received: 8 February 2016
Accepted: 13 April 2016
Published: 26 April 2016
The standard ordinary least squares based Granger causality is one of the widely used methods for detecting causal interactions between time series data. However, recent developments in technology limit the utilization of some existing implementations due to the availability of high dimensional data. In this paper, we are proposing a technique called Bi-directional Random Forest Granger causality. This technique uses the random forest regularization together with the idea of reusing the time series data by reversing the time stamp to extract more causal information. We have demonstrated the effectiveness of our proposed method by applying it to simulated data and then applied it to two real biological datasets, i.e., fMRI and HeLa cell. fMRI data was used to map brain network involved in deductive reasoning while HeLa cell dataset was used to map gene network involved in cancer.
KeywordsBiological network Brain connectivity Gene network Random forest Granger causality
The concept of causal influence can be dated back in 1956 when Wiener (1956) conceived idea that if including the information of one time series can improve the prediction of other time series, this means that the second series has a causal influence on the other. After more than a decade, the same concept was practically formalized by Granger (1969) in 1969, for studying the causal interaction between financial time series data. Moreover, recently the idea of Granger causality has also been utilized in bio-informatics for studying brain connectivity map (Ding et al. 2006; Hu and Liang 2014; Lang et al. 2012; Liao et al. 2011), gene networks (Michailidis and d’Alche-Buc 2013; Tam et al. 2012), and more.
However, with the advancement in technology, data acquisition techniques can now simultaneously analyze multiple variables and produce high-dimensional data, and since Granger uses ordinary least squares (OLS) method for evaluating Granger causality, it is not a viable option when it comes to handling high dimensional data. The reason for this limitation is the fact that the OLS application requires less number of variables compared to observational time points. Therefore, in order to resolve this limitation, several alternates were discussed in the past that includes the use of other regularization techniques (Shojaie and Michailidis 2010; Tang et al. 2012; Valdés-Sosa et al. 2005), kernel-based methods (Liu et al. 2014; Marinazzo et al. 2008) and neural network based methods (Montalto et al. 2015).
Recently, two viable options were discussed by Furqan and Siyal (2015) and Cheng et al. (2014). Furqan and Siyal (2015) proposed to use Random Forest as a regularization technique for evaluating Granger causality whereas Cheng et al. (2014) proposed an LASSO-based method to reuse the time series data by reversing the time stamp of the time series. This concept of time reversal is also discussed and used by other researchers including Haufe et al. (2012), Hu et al. (2015) and others.
In this paper, we are proposing an improved method based on a combination of Random Forest Granger causality and re-utilization of time series data. We are calling it Bi-directional Random Forest Granger causality. This proposed method has increased precision and efficiency compared to existing LASSO-based method proposed by Cheng et al. (2014). In order to provide the proof of improvements of our method, we applied these methods to simulated data before mapping two different real biological networks i.e., gene and brain network.
Random Forest Granger causality
Naïve Forward Backward LASSO Granger causality
Cheng et al. (2014) proposed Naïve Forward Backward LASSO Granger causality which can handle the shortage of data by reusing the time series data after reversing the time stamp of data. They called this method Naïve Forward Backward LASSO Granger causality. In explaining their proposed method, they use the assumption that the original time series validates all necessary conditions to perform Granger casualty analysis as studied in Bahadori and Liu (2013) and Eichler (2011) and included linearity and stationarity of time series. Once all the conditions are validated, they have proposed to use the pseudo code discussed below that uses LASSO-Based Granger causality analysis algorithm that is available at Bahadori (2014).
Bi-direction Random Forest Granger causality
Based on the findings of Naïve Forward Backward LASSO Granger Causality and Random Forest Granger causality, we are proposing to use Random Forest Granger causality together with the concept of re-utilization of time series data by reversing the data time stamps in order to maximize advantages in terms of precision, false discovery rate, recall, and F1-score. The pseudo code for evaluating Bi-directional Random Forest Granger causality is as follow:
We have implemented the basic Random Forest method on MATLAB with the help of R package (Breiman 2001). Later, we merged the implemented code with Granger causality analysis (GCCA) toolbox (Seth 2010) for evaluating Granger causality that uses BSMART toolbox (Cui et al. 2008). Whereas, we have used Akaike Information Criterion (AIC) as discussed by Akaike (1974) for VAR model order selection.
Real fMRI dataset
In this paper, we have utilized StarPlus data set which was collected to study the working of the brain related to human deductive reasoning. This StarPlus dataset was collected by Keller et al. (2001) and can be freely accessed from Mitchell and Wang (2001).
In this dataset, they had studied 13 normal subjects using 40 trials on each subject. Each trial consists of two major egments. In one segment of the trial, the subject was presented with a visual stimulus in the form of Image for 4 s followed by a 4-s blank screen. Then, in next segment, another visual stimulus was presented for another 4-s in the form of a sentence wich may or may not be related to the image. This visual stimulus was followed by 4-s blank screen. After both stimuli, the subject was asked to decide the presence of a relation between image and sentence. Moreover, each subject was allowed to rest for 15-s before the start of next trial.
In order to introduce randomness in the experiment, 40 trials were divided into two parts of 20 trials each. In 20 trials, subjects were shown image first and then the sentence whereas for remaining 20 trials, they reversed the order of image and sentence. Further information related to experiment settings, sentences, and picture, are explicitly not discussed here and can be referred to Keller et al. (2001).
While performing these trials, T2-weighted fMRI images were collected using 3T Signa scanner at an interval of 500 ms, and with TE = 18 ms and flip angle of 50°. These settings yield images that have approximately 5000 voxels per subjects in 8 oblique axial slices in two different non-contiguous four-slice volumes. The first volume set captures prefrontal areas and superior parietal regions, while, another volume set covers posterior temporal, inferior frontal and occipital areas.
After acquiring T2-weighted fMRI images for each subject, images were pre-processed using FIASCO program (Eddy et al. 1999). This pre-processing helps in reducing the artifacts that arise during image acquisition process due to signal drift, head motion, and others.
After pre-processing of images, 25 anatomical regions of interest were selected that includes left dorsolateral prefrontal cortex (LDLPFC) and right dorsolateral prefrontal cortex (RDLPFC), calcarine sulcus (CALC), left frontal eye fields (LFEF), right frontal eye fields (RFEF), left inferior parietal lobule (LIPL), right inferior parietal lobule (RIPL), left intraparietal sulcus (LIPS), right intraparietal sulcus (RIPS), left inferior frontal gyrus (LIFG), left opercularis (LOPER), right opercularis (ROPER), supplementary motor areas (SMA), left and right inferior temporal lobule (LIT, RIT), left and right posterior precentral sulcus (LPPREC, RPPREC), left and right supramarginal gyrus (LSGA, RSGA), left temporal lobe (LT), right temporal lobe (RT), left and right triangularis (LTRIA, RTRIA), left superior parietal lobule (LSPL) and right superior parietal lobule (RSPL). However, we have restricted our study to 7 regions of interests (ROIs) that were used and advised to be more relevant by other researchers (Furqan and Siyal 2015; Wang and Mitchell 2002) and include LIPL, LDLPFC, CALC, LTRIA, LT, LOPER, and LIPS.
Real Hela dataset
The HeLa human cancer cell line dataset used in our study was compiled by Whitfield et al. (2002) by performing series of experiments using DNA microarray technique. These experimental results are freely available (Whitfield et al. 2000).
In our study, we have used their experiment 3 dataset to prove more effectiveness of our method as other researchers have commonly used this dataset as well (Hlavácková-Schindler and Bouzari 2013; Lozano et al. 2009). The Experiment 3 dataset has recognized more than 1100 genes that are intermittently expressed during the cancer cell cycle. Based on the recommendations of other researchers (Hlavácková-Schindler and Bouzari 2013; Ogutu et al. 2012), we have used 19 preselected genes that are: PCNA, NPAT, E2F1, CCNE1, CDC25A, CDKN1A, BRCA1, CCNF, CCNA2, CDC20, STK15, BUB1B, CKS2, CDC25C, PLK1, CCNB1, CDC25B, TYMS, and DHFR.
As the observational points are not homogeneously sampled, the data was first interpolated by using cubic smoothing splines (Green and Silverman 1994) as recommended by Hlavácková-Schindler and Bouzari (2013) and Ogutu et al. (2012) before using in our study.
Results and discussion
These findings suggest that our proposed method has outperformed the existing method in all measures, with a significant improvement in recall. Our proposed method shows 20 % improvement in recall compared to existing LASSO-based method.
During this study, we have observed that the proposed method is less prone to outliers compared to the LASSO-based method. This ability of insensitivity of outlier is achieved due to inherent advantage of regularized tree methods. We have also observed that the proposed method is highly dependent on selecting the right number of features and number of trees. In this study, we have used the setting of 10 features and 500 trees. However, further studies are required to devise some ideal relationship between both number features and number of trees.
HeLa cell dataset
As there is no way to verify the resultant network, we have used Biological General Repository for Interaction Datasets BIOGRID database (Chatr-aryamontri et al. 2014) to look for genes interactions that were already reported. The BIOGRID is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans. Given the above network map, we were able of find 6 out 16 interactions that yield around 37 % precision and 63 % false discovery rate. These statistics are in line with the results of the simulated dataset where BRFGC produces 28 % precision and 63 % false discovery rate.
StarPlus fMRI dataset
Other regions of interest are left opercularis (LOPER) and left triangularis (LTRIA) which are also called Brodmann Area 44 and Brodmann Area 45 (Nishitani et al. 2005), and together they constitute Broca’s region. The Broca’s region is associated with the processing of words, pseudo-words, and non-words during different parts of reading and their interaction as discussed in Heim et al. (2005).
Left dorsolateral prefrontal cortex (LDLPFC) is associated with manipulation of auditory and spatial information in working memory (Barbey et al. 2013) whereas left inferior parietal lobule (LIPL) is necessary for comparison (Chochon et al. 1999), memory related to motor processes (e.g., movement of hand), mechanical and technical reasoning associated with the use of objects (van Elk 2014) and more. Whereas, the remaining region under consideration is left Temporal Lobe (LT) which is mainly associated with the primary organization of sensory inputs (Read 1981).
Based on the functional knowledge of regions of interests, our resulted network in Fig. 3 shows that the connection between CALC with LIPS seems to transfer visual information (picture or sentence displayed on screen), the bi-direction link between LOPER and LIPS signifies the feed-backed link for recognizing the objects and words. The connection between Brodmann area 44 and 45 shows the movement of information from area 44 to area 45 for further processing of information.
The other links such as the links from Brodmann area 45 represents the transfer of information to and from LDLPFC, LIPL and LT for further processing to evaluate the meaning, relation and deduction of the task performed. The remaining bidirectional link between LIPL ↔ LDLPFC and LT ↔ LDLPFC exchange information related to the movement to finger for registering the answer to the task.
In this paper, we have proposed an improved method called Bi-directional Random Forest Granger causality. It takes the advantage of Random Forest regularization to handle dimensionality issues and at the same time using reversing time stamping property it limits the data shortage problem. Using simulated dataset we have shown the effectiveness of our proposed method and later, we have applied the proposed approach to real StarPlus fMRI data set to study the network involved in human deductive reasoning and to real HeLa cell dataset to map gene network that is involved in cancer. In future, this method can be used in other areas such as econometrics, and social networking.
MSF conceived, implemented, and analyzed the study presented in this paper. SMY coordinated and helped to draft the manuscript. All authors read and approved the final manuscript.
Authors like to appreciate the help by K. Hlavácková-Schindler for helping in understanding HeLa cell dataset.
The authors declare that they have no competing interests.
Although this study involves human participants, formal consents or ethical committee approval is not required as experimental data used in this research is not collected by current authors and is freely accessible. HeLa cell Genetic is acquired from published article of Michael et al. and can be accessed from http://genome-www.stanford.edu/Human-CellCycle/Hela/data.shtml. Similarly, StarPlus fMRI data is acquired from published work of Mitchell et al. and can be accessed freely from http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723. doi:https://doi.org/10.1109/TAC.1974.1100705 View ArticleGoogle Scholar
- Bahadori MT (2014) Lasso-Granger. http://www-scf.usc.edu/~mohammab/codes/codes.html
- Bahadori MT, Liu Y (2013) An examination of practical granger causality inference. Paper presented at the 2013 SIAM international conference on data mining, Austin, Texas, USAGoogle Scholar
- Barbey AK, Koenigs M, Grafman J (2013) Dorsolateral prefrontal contributions to human working memory. Cortex 49(5):1195–1205. doi:https://doi.org/10.1016/j.cortex.2012.05.022 View ArticleGoogle Scholar
- Breiman L (2001) Random forests. Mach Learn 45:5–32View ArticleGoogle Scholar
- Chatr-Aryamontri A, Breitkreutz BJ, Oughtred R, Boucher L, Heinicke S, Chen D, Stark C, Breitkreutz A, Kolas N, O'Donnell L, Reguly T, Nixon J, Ramage L, Winter A, Sellam A, Chang C, Hirschman J, Theesfeld C, Rust J, Livstone MS, Dolinski K, Tyers M (2015) The BioGRID interaction database: 2015 update. Nucleic Acids Res 43(Database issue):D470–D478. doi:https://doi.org/10.1093/nar/gku1204 View ArticleGoogle Scholar
- Cheng D, Bahadori MT, Liu Y (2014) FBLG: a simple and effective approach for temporal dependence discovery from time series data. Paper presented at the Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, New York, USAGoogle Scholar
- Chochon F, Cohen L, Van De Moortele P, Dehaene S (1999) Differential contributions of the left and right inferior parietal lobules to number processing. J Cogn Neurosci 11(6):617–630View ArticleGoogle Scholar
- Cui J, Xu L, Bressler SL, Ding M, Liang H (2008) BSMART: a Matlab/C toolbox for analysis of multichannel neural time series. Neural Netw 21(8):1094–1104. doi:https://doi.org/10.1016/j.neunet.2008.05.007 View ArticleGoogle Scholar
- Ding M, Chen Y, Bressler SL (2006) Granger causality: basic theory and application to neuroscience handbook of time series analysis. Wiley, New York, pp 438–460Google Scholar
- Eddy WF, Fitzgerald M, Genovese C, Lazar N, Mockus A, Welling J (1999) The challenge of functional magnetic resonance imaging. J Comput Graph Stat 8(3):545–558Google Scholar
- Eichler M (2011) Graphical modelling of multivariate time series. Probab Theory Relat Fields 153(1):233–268. doi:https://doi.org/10.1007/s00440-011-0345-8 Google Scholar
- Furqan MS, Siyal MY (2015) Random Forest Granger causality for detection of effective brain connectivity using high dimensional data. J Integr Neurosci 14:1–12. doi:https://doi.org/10.1142/S0219635216500035 View ArticleGoogle Scholar
- Granger CWJ (1969) Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37(3):424–438View ArticleGoogle Scholar
- Green PJ, Silverman BW (1994) Nonparametric regression and generalized linear models: a roughness penalty approach. Chapman and Hall, New YorkView ArticleGoogle Scholar
- Haufe S, Nikulin VV, Nolte G (2012) Alleviating the influence of weak data asymmetries on granger-causal analyses. In: Theis F, Cichocki A, Yeredor A, Zibulevsky M (eds) Latent variable analysis and signal separation: 10th international conference, LVA/ICA 2012, Tel Aviv, Israel, March 12–15, 2012. Proceedings. Springer, Berlin, pp 25–33Google Scholar
- Heim S, Alter K, Ischebeck AK, Amunts K, Eickhoff SB, Mohlberg H, Zilles K, von Cramon DY, Friederici AD (2005) The role of the left Brodmann’s areas 44 and 45 in reading words and pseudowords. Cogn Brain Res 25(3):982–993. doi:https://doi.org/10.1016/j.cogbrainres.2005.09.022 View ArticleGoogle Scholar
- Hlavácková-Schindler K, Bouzari H (2013) Granger Lasso Causal Models in higher dimensions-application to gene expression regulatory networks. ECML/PKDD 2013 workshop scalable decision making: uncertainty, imperfection, deliberation (SCALE)Google Scholar
- Hu M, Liang H (2014) A copula approach to assessing Granger causality. NeuroImage 100:125–134. doi:https://doi.org/10.1016/j.neuroimage.2014.06.013 View ArticleGoogle Scholar
- Hu M, Li W, Liang H (2015) A copula-based Granger causality measure for the analysis of neural spike train data. IEEE/ACM Trans Comput Biol Bioinf. doi:https://doi.org/10.1109/TCBB.2014.2388311 Google Scholar
- Keller TA, Just MA, Stenger VA (2001) Reading span and the time-course of cortical activation in sentence-picture verification. Paper presented at the annual convention of the Psychonomic Society, Orlando, FLGoogle Scholar
- Lang EW, Tomé AM, Keck IR, Górriz-Sáez JM, Puntonet CG (2012) Brain connectivity analysis: a short survey. Comput Intell Neurosci. doi:https://doi.org/10.1155/2012/412512 Google Scholar
- Liao W, Ding J, Marinazzo D, Xu Q, Wang Z, Yuan C, Zhang Z, Lu G, Chen H (2011) Small-world directed networks in the human brain: multivariate Granger causality analysis of resting-state fMRI. NeuroImage 54(4):2683–2694. doi:https://doi.org/10.1016/j.neuroimage.2010.11.007 View ArticleGoogle Scholar
- Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22Google Scholar
- Liu J, Xu Y, Cheng J, Zhang Z, Wong D, Yin F, Wong T (2014) Multiple modality fusion for glaucoma diagnosis. In: Zhang Y-T (ed) The international conference on health informatics, vol 42. Springer, Berlin, pp 5–8View ArticleGoogle Scholar
- Lozano AC, Abe N, Liu Y, Rosset S (2009) Grouped graphical Granger modeling for gene expression regulatory networks discovery. Bioinformatics 25(12):i110–i118. doi:https://doi.org/10.1093/bioinformatics/btp199 View ArticleGoogle Scholar
- Marinazzo D, Pellicoro M, Stramaglia S (2008) Kernel-Granger causality and the analysis of dynamical networks. Phys Rev E 77(5):056215. doi:https://doi.org/10.1103/PhysRevE.77.056215 View ArticleGoogle Scholar
- Meadows M-E (2011) Calcarine cortex. In: Kreutzer J, DeLuca J, Caplan B (eds) Encyclopedia of clinical neuropsychology. Springer, New York, p 472View ArticleGoogle Scholar
- Michailidis G, d’Alche-Buc F (2013) Autoregressive models for gene regulatory network inference: sparsity, stability and causality issues. Math Biosci 246(2):326–334. doi:https://doi.org/10.1016/j.mbs.2013.10.003 View ArticleGoogle Scholar
- Mitchell T, Wang W (2001) StarPlus fMRI data. http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-81/www/
- Montalto A, Stramaglia S, Faes L, Tessitore G, Prevete R, Marinazzo D (2015) Neural networks with non-uniform embedding and explicit validation phase to assess Granger causality. Neural Netw 71:159–171. doi:https://doi.org/10.1016/j.neunet.2015.08.003 View ArticleGoogle Scholar
- Nishitani N, Schürmann M, Amunts K, Hari R (2005) Broca’s region: from action to language. Physiology 20(1):60–69. doi:https://doi.org/10.1152/physiol.00043.2004 View ArticleGoogle Scholar
- Ogutu JO, Schulz-Streeck T, Piepho HP (2012) Genomic selection using regularized linear regression models: ridge regression, lasso, elastic net and their extensions. BMC Proc 6(Suppl 2):S10. doi:https://doi.org/10.1186/1753-6561-6-S2-S10 View ArticleGoogle Scholar
- Read DE (1981) Solving deductive-reasoning problems after unilateral temporal lobectomy. Brain Lang 12(1):116–127. doi:https://doi.org/10.1016/0093-934X(81)90008-0 View ArticleGoogle Scholar
- Schelter B, Winterhalder M, Eichler M, Peifer M, Hellwig B, Guschlbauer B, Lücking CH, Dahlhaus R, Timmer J (2006) Testing for directed influences among neural signals using partial directed coherence. J Neurosci Methods 152(1–2):210–219. doi:https://doi.org/10.1016/j.jneumeth.2005.09.001 View ArticleGoogle Scholar
- Seth AK (2010) A MATLAB toolbox for Granger causal connectivity analysis. J Neurosci Methods 186(2):262–273. doi:https://doi.org/10.1016/j.jneumeth.2009.11.020 View ArticleGoogle Scholar
- Shojaie A, Michailidis G (2010) Discovering graphical Granger causality using the truncating lasso penalty. Bioinformatics 26(18):i517–i523. doi:https://doi.org/10.1093/bioinformatics/btq377 View ArticleGoogle Scholar
- Smith KW, Vartanian O, Goel V (2014) Dissociable neural systems underwrite logical reasoning in the context of induced emotions with positive and negative valence. Front Hum Neurosci 8:736. doi:https://doi.org/10.3389/fnhum.2014.00736 Google Scholar
- Tam GHF, Chunqi C, Yeung Sam H (2012, 18–20 Aug. 2012). Application of Granger causality to gene regulatory network discovery. Paper presented at the 2012 IEEE 6th international conference on systems biology (ISB)Google Scholar
- Tang W, Bressler SL, Sylvester CM, Shulman GL, Corbetta M (2012) Measuring Granger Causality between cortical regions from voxelwise fMRI BOLD signals with LASSO. PLoS Comput Biol. doi:https://doi.org/10.1371/journal.pcbi.1002513 Google Scholar
- Valdés-Sosa PA, Sánchez-Bornot JM, Lage-Castellanos A, Vega-Hernández M, Bosch-Bayard J, Melie-García L, Canales-Rodríguez E (2005) Estimating brain functional connectivity with sparse multivariate autoregression. Philos Trans R Soc B Biol Sci 360(1457):969–981. doi:https://doi.org/10.1098/rstb.2005.1654 View ArticleGoogle Scholar
- van Elk M (2014) The left inferior parietal lobe represents stored hand-postures for object use and action prediction. Front Psychol 5:333. doi:https://doi.org/10.3389/fpsyg.2014.00333 Google Scholar
- Wang X, Mitchell T (2002) Detecting cognitive states using machine learning. Iterim working paperGoogle Scholar
- Whitfield ML, Sherlock G, Saldanha A, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D (2000) Identification of genes periodically expressed in the human cell cycle and their expression in tumors. http://genome-www.stanford.edu/Human-CellCycle/HeLa/
- Whitfield ML, Sherlock G, Saldanha AJ, Murray JI, Ball CA, Alexander KE, Matese JC, Perou CM, Hurt MM, Brown PO, Botstein D (2002) Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol Biol Cell 13(6):1977–2000. doi:https://doi.org/10.1091/mbc.02-02-0030 View ArticleGoogle Scholar
- Wiener N (1956) The theory of prediction. In: Beckenbach E (ed) Modern mathematics for engineers. McGraw-Hill, New YorkGoogle Scholar