Inference of biological networks using Bi-directional Random Forest Granger causality

The standard ordinary least squares based Granger causality is one of the widely used methods for detecting causal interactions between time series data. However, recent developments in technology limit the utilization of some existing implementations due to the availability of high dimensional data. In this paper, we are proposing a technique called Bi-directional Random Forest Granger causality. This technique uses the random forest regularization together with the idea of reusing the time series data by reversing the time stamp to extract more causal information. We have demonstrated the effectiveness of our proposed method by applying it to simulated data and then applied it to two real biological datasets, i.e., fMRI and HeLa cell. fMRI data was used to map brain network involved in deductive reasoning while HeLa cell dataset was used to map gene network involved in cancer.

Recently, two viable options were discussed by Furqan and Siyal (2015) and Cheng et al. (2014). Furqan and Siyal (2015) proposed to use Random Forest as a regularization technique for evaluating Granger causality whereas Cheng et al. (2014) proposed an LASSO-based method to reuse the time series data by reversing the time stamp of the time series. This concept of time reversal is also discussed and used by other researchers including Haufe et al. (2012), Hu et al. (2015) and others.
In this paper, we are proposing an improved method based on a combination of Random Forest Granger causality and re-utilization of time series data. We are calling it Bi-directional Random Forest Granger causality. This proposed method has increased precision and efficiency compared to existing LASSO-based method proposed by Cheng et al. (2014). In order to provide the proof of improvements of our method, we applied these methods to simulated data before mapping two different real biological networks i.e., gene and brain network.

Random Forest Granger causality
Random Forest is a decision tree based learning algorithm that was initially proposed by Breiman (2001) as a classification technique. However, later Liaw and Wiener (2002) suggested that Random Forest can also be used as regularization technique. This proposition of Liaw and Wiener (2002) to use Random Forest as a regularization technique was discussed and applied by Furqan and Siyal (2015) for evaluating coefficients of vector autoregressive model. They have performed Rigorous experimentations to prove its effectiveness. Its implementation follows the ray diagram shown in Fig. 1. Cheng et al. (2014) proposed Naïve Forward Backward LASSO Granger causality which can handle the shortage of data by reusing the time series data after reversing the time stamp of data. They called this method Naïve Forward Backward LASSO Granger causality. In explaining their proposed method, they use the assumption that the original time series validates all necessary conditions to perform Granger casualty analysis as studied in Bahadori and Liu (2013) and Eichler (2011) Furqan and Siyal (2015) time series. Once all the conditions are validated, they have proposed to use the pseudo code discussed below that uses LASSO-Based Granger causality analysis algorithm that is available at .

Bi-direction Random Forest Granger causality
Based on the findings of Naïve Forward Backward LASSO Granger Causality and Random Forest Granger causality, we are proposing to use Random Forest Granger causality together with the concept of re-utilization of time series data by reversing the data time stamps in order to maximize advantages in terms of precision, false discovery rate, recall, and F1-score. The pseudo code for evaluating Bi-directional Random Forest Granger causality is as follow:

Experimental details
We have implemented the basic Random Forest method on MATLAB with the help of R package (Breiman 2001). Later, we merged the implemented code with Granger causality analysis (GCCA) toolbox (Seth 2010) for evaluating Granger causality that uses BSMART toolbox (Cui et al. 2008). Whereas, we have used Akaike Information Criterion (AIC) as discussed by Akaike (1974) for VAR model order selection. After the implementation of proposed method, we have compared our method with Cheng et al. (2014), LASSO-based method. Cheng's method, using four measures: precision, false discovery rate, recall, and F1-score. These measures were evaluated against ground truth network shown in Fig. 5 using the following mathematical equations:

Simulated network
In order to remain unbiased in our comparative study, we utilized a simulated network dataset that has been previously used by researchers like Furqan and Siyal (2015), Schelter et al. (2006), and more. The simulated data set simulates five variable scenarios. Its ground truth network is shown in Fig. 2, and its network can be modeled using following mathematical equations: where ɛ 1 (t), ɛ 2 (t), ɛ 3 (t), ɛ 4 (t), and ɛ 5 (t) are independent and identically distributed white

Real fMRI dataset
In this paper, we have utilized StarPlus data set which was collected to study the working of the brain related to human deductive reasoning. This StarPlus dataset was collected by Keller et al. (2001) and can be freely accessed from Mitchell and Wang (2001).
In this dataset, they had studied 13 normal subjects using 40 trials on each subject. Each trial consists of two major egments. In one segment of the trial, the subject was presented with a visual stimulus in the form of Image for 4 s followed by a 4-s blank screen. Then, in next segment, another visual stimulus was presented for another 4-s in the form of a sentence wich may or may not be related to the image. This visual stimulus was followed by 4-s blank screen. After both stimuli, the subject was asked to decide the presence of a relation between image and sentence. Moreover, each subject was allowed to rest for 15-s before the start of next trial.

Precision =
True positive edges True positive edges + False positive edges .

Recall =
True positive edges True positive edges + False negative edges F 1-Score = 2 × True positive edges 2 × True positive edges + False positive edges + False negative edges In order to introduce randomness in the experiment, 40 trials were divided into two parts of 20 trials each. In 20 trials, subjects were shown image first and then the sentence whereas for remaining 20 trials, they reversed the order of image and sentence. Further information related to experiment settings, sentences, and picture, are explicitly not discussed here and can be referred to Keller et al. (2001).
While performing these trials, T2-weighted fMRI images were collected using 3T Signa scanner at an interval of 500 ms, and with TE = 18 ms and flip angle of 50°. These settings yield images that have approximately 5000 voxels per subjects in 8 oblique axial slices in two different non-contiguous four-slice volumes. The first volume set captures prefrontal areas and superior parietal regions, while, another volume set covers posterior temporal, inferior frontal and occipital areas.
After acquiring T2-weighted fMRI images for each subject, images were pre-processed using FIASCO program (Eddy et al. 1999). This pre-processing helps in reducing the artifacts that arise during image acquisition process due to signal drift, head motion, and others.
After pre-processing of images, 25 anatomical regions of interest were selected that includes left dorsolateral prefrontal cortex (LDLPFC) and right dorsolateral prefrontal cortex (RDLPFC), calcarine sulcus (

Real Hela dataset
The HeLa human cancer cell line dataset used in our study was compiled by Whitfield et al. (2002) by performing series of experiments using DNA microarray technique. These experimental results are freely available (Whitfield et al. 2000).
As the observational points are not homogeneously sampled, the data was first interpolated by using cubic smoothing splines (Green and Silverman 1994) as recommended by Hlavácková-Schindler and Bouzari (2013) and Ogutu et al. (2012) before using in our study.

Simulated dataset
Based on the results of simulated studies shown in Fig. 3, we found that LASSO-based Forward Backward Granger causality on average yields approximately 25 % precision, 75 % false discovery rate, 67 % recall and 37 % F1 score. Whereas using the same set of data, our proposed method yields 28 % precision, 70 % false discovery rate, 87 % recall, and 40 % F1 score.
These findings suggest that our proposed method has outperformed the existing method in all measures, with a significant improvement in recall. Our proposed method shows 20 % improvement in recall compared to existing LASSO-based method.
During this study, we have observed that the proposed method is less prone to outliers compared to the LASSO-based method. This ability of insensitivity of outlier is achieved due to inherent advantage of regularized tree methods. We have also observed that the proposed method is highly dependent on selecting the right number of features and number of trees. In this study, we have used the setting of 10 features and 500 trees. However, further studies are required to devise some ideal relationship between both number features and number of trees.

HeLa cell dataset
Following the findings of simulated data set studies, we have applied the proposed method to real HeLa cell dataset. The resultant gene network that is involved in cancers is shown in Fig. 4 where the green arrow shows a uni-directional link between two nodes.
As there is no way to verify the resultant network, we have used Biological General Repository for Interaction Datasets BIOGRID database (Chatr-aryamontri et al. 2014) to look for genes interactions that were already reported. The BIOGRID is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans. Given the above network map, we were able of find 6 out 16 interactions that yield around 37 % precision and 63 % false discovery rate. These statistics are in line with the results of the simulated dataset where BRFGC produces 28 % precision and 63 % false discovery rate.

StarPlus fMRI dataset
For discussing results of real StarPlus dataset shown in Fig. 5, let's first overview the functions of the pre-selected regions studied in this paper. The first region under Fig. 3 Results of five variable simulated datasets consideration is calcarine sulcus (CALC). CALC consist of calcarine cortex that maps the point-to-point representation from the retina to the cortex as discussed by Meadows (2011). The next region under consideration is left intraparietal sulcus (LIPS). This region of the brain is associated with the processing of light contrast elements seen by eyes without analyzing the relationship between those elements (Smith et al. 2014).
Other regions of interest are left opercularis (LOPER) and left triangularis (LTRIA) which are also called Brodmann Area 44 and Brodmann Area 45 (Nishitani et al. 2005), and together they constitute Broca's region. The Broca's region is associated with the processing of words, pseudo-words, and non-words during different parts of reading and their interaction as discussed in Heim et al. (2005).
Left dorsolateral prefrontal cortex (LDLPFC) is associated with manipulation of auditory and spatial information in working memory (Barbey et al. 2013) whereas left inferior parietal lobule (LIPL) is necessary for comparison (Chochon et al. 1999), memory related to motor processes (e.g., movement of hand), mechanical and technical reasoning associated with the use of objects (van Elk 2014) and more. Whereas, the remaining  region under consideration is left Temporal Lobe (LT) which is mainly associated with the primary organization of sensory inputs (Read 1981).
Based on the functional knowledge of regions of interests, our resulted network in Fig. 3 shows that the connection between CALC with LIPS seems to transfer visual information (picture or sentence displayed on screen), the bi-direction link between LOPER and LIPS signifies the feed-backed link for recognizing the objects and words. The connection between Brodmann area 44 and 45 shows the movement of information from area 44 to area 45 for further processing of information.
The other links such as the links from Brodmann area 45 represents the transfer of information to and from LDLPFC, LIPL and LT for further processing to evaluate the meaning, relation and deduction of the task performed. The remaining bidirectional link between LIPL ↔ LDLPFC and LT ↔ LDLPFC exchange information related to the movement to finger for registering the answer to the task.

Conclusion
In this paper, we have proposed an improved method called Bi-directional Random Forest Granger causality. It takes the advantage of Random Forest regularization to handle dimensionality issues and at the same time using reversing time stamping property it limits the data shortage problem. Using simulated dataset we have shown the effectiveness of our proposed method and later, we have applied the proposed approach to real StarPlus fMRI data set to study the network involved in human deductive reasoning and to real HeLa cell dataset to map gene network that is involved in cancer. In future, this method can be used in other areas such as econometrics, and social networking.