Datasets
The datasets used to train and test our system are built using QuickGO, a web-based tool designed for browsing the GO database. We started from the GO release of 06/06/2011 and built three datasets, one for each kindgom considered. We selected proteins annotated with GO terms related to the subcellular localizations we consider.
Table 3 shows, for each class, the GO term or terms used to filter sequences, and the number of proteins obtained per class for each taxonomic group. Constructing this set, called full_dataset, proteins annotated as “inferred from electronic annotation”, “non-traceable author statement”, “no data” and “inferred by curator” were left out in order to exclude sequences of little-known origin or of uncertain localization.
From the full_dataset we excluded sequences shorter than 30 amino acids. Then we reduced redundancy, separately for each taxonomic group, performing an all-against-all BLAST (Altschul et al. 1997) search with an e-value of 10-3 and excluding sequences with a sequence identity of 30% or greater to any other sequence in the group that was retained. Table 4 reports the final numbers of sequences in the sets after redundancy reduction(reduced_dataset). We use reduced_dataset for training/testing purposes.
Input coding
As in (Mooney et al. 2011), we enrich the description of protein sequences using residue frequency profiles from alignments of multiple homologous sequences (MSA). This is common practice in many predictive systems of structural and functional properties of proteins, as MSA provide information about the evolution of a protein (Rost and Sander 1993). We built a “profile” for each protein in the following way: the k-th residue in a protein is encoded as a sequence of 20 real numbers in which each number is the frequency of one of the 20 amino acids in the k-th column of the MSA, gaps excluded; an additional 21st real number is used to represent the frequency of gaps in the k-th column. Sequence alignments are extracted from uniref90 (Suzek et al. 2007) from February 2010 containing 6,464,895 sequences. The alignments are generated by three runs of PSI-BLAST (Altschul et al. 1997) with parameters b = 3000 (maximum number of hits) and e = 10-3 (expectation of a random hit) (Mooney et al. 2011). We refer to this first encoding as MSA_dataset.
In a second step, we encode proteins in our dataset adding three inputs per residue describing the secondary structure that the residue is predicted to belong to, according to the Porter server (Pollastri and McLysaght 2005; Pollastri et al. 2007). We call this encoding MSA+SS_dataset.
We train and test two versions of the sequence-based architecture using, respectively, the MSA_dataset and the MSA+SS_dataset, which contain the same proteins, but have different input encoding.
In another set of experiments we add homology information from proteins of known subcellular localization. Similarly to (Pollastri et al. 2007; Mooney 2009; Walsh et al. 2009a; Walsh et al. 2009b), homology is used as a further input to the predictor, alongside a measure of its estimated quality. The predictor itself determines how to weigh the information coming directly from the input sequence and MSA, and how to weigh the annotations coming from homologous proteins into the final prediction. Homology information itself is extracted by performing a BLAST (Altschul et al. 1997) search for each sequence in reduced_dataset against the full_dataset with an e-value of 10-3. For each sequence i in reduced_dataset we select the K
i
sequences in full_dataset having an Identity Score higher than 30% (but smaller than 95%, to exclude the protein itself) and we calculate a vector N+1 terms long, where N is the number of classes predicted (five in Fungi and Animal cases, six in the Plant case) as:
(1)
where is a vector of N units in which the k-th entry is set to one if the j-th protein belongs to the k-th class, to zero otherwise; I
ij
is the identity between sequence i in the reduced_dataset and sequence j among the K
i
in full_dataset that is homologous to sequence i. Taking the cube of the identity scores reduces the contribution of low-similarity proteins while high-similarity sequences are available. The N+1-th element in the vector T
i
measures the significance of the information stored in the vector and is computed as the average identity, weighed by the cubed identity. That is:
(2)
In other words: each protein in reduced_dataset is aligned against full_dataset; the significant hits (templates) are retrieved, with their (known) subcellular locations; a profile of subcellular locations is compiled from these templates, where templates that are more closely related to the protein are weighed more than more remote ones, according to the score in Equation 2; this subcellular location profile is provided as an extra input to the network. Notice that in the case in which all templates from full_dataset are in the same subcellular location class, vector has only two non-zero components: the entry corresponding to the class (which in this case is 1), and the last entry which measures the average sequence identity of the templates.
Thus, in this third set of experiments (MSA+HOM_dataset), a vector containing homology information is associated to each sequence+MSA. Again, while the proteins are the same as in the MSA_dataset and the MSA+SS_dataset, the information provided to the predictor is different.
Predictive architecture
In this work we test two different predictive systems based on the model proposed in (Mooney et al. 2011). This model is a N-to-1 neural network, or N1-NN, composed by a sequence of two two-layered feed-forward neural networks.
The first architecture we test is essentially the same as in (Mooney et al. 2011), in which different numbers of inputs per residue are fed to the system. In N1-NN a lower level network takes as input a window or motif of a fixed number of residues. 21 (MSA_dataset case) or 24 (MSA+SS_dataset) real numbers encode each residue. The lower level network is replicated for each of the (overlapping) motifs in the sequence and produces a vector of real numbers as output. A feature vector for the whole sequence is calculated as the sum of the output vectors of all the lower network replicas (Mooney et al. 2011). contains a sequence of descriptors automatically learned in order to minimize the overall error, that is, to obtain an optimal final prediction. Thus can be thought of as a property-driven adaptive compression of the sequence into a fixed number of descriptors. The vector is obtained as:
(3)
where is the sequence of real numbers (21 or 24) associated with the residue i in a L-length sequence, k is a normalization constant (set to 0.01 in all our tests) and c is a constant that determines the length of the window of residues (2c+1) that is fed to the network. We use c=20 in all the experiments in this article, corresponding to motifs of 41 residues. We obtained a value of c=20 from preliminary tests, in which it proved (marginally) better than 10 and 15, but we also considered that the average size for motifs that sort a protein to a subcellular location is generally smaller (but close to) 40 residues. For instance the average length of signal peptides in eukaryotes is approximately 20 residues (Bendtsen et al. 2004), and 35-40 is an upper size bound for most known signals and NLS (Bendtsen et al. 2004; Cokol et al. 2000). We set k=0.01 because the number of replicas of is typically between several tens and a few hundreds. Different choices for k are possible in principle, including making it a learnable parameter, although we have not explored this option.
The feature vector is fed to a second level network that performs the final prediction as:
(4)
A standard N-to-1 NN is depicted in Figure 4.
In the second (template-based) architecture we add a second lower level neural network, that takes as input the additional vector T included in the MSA+HOM_dataset. So the feature vector f is now calculated as
(5)
in which and are two-layer perceptrons as in the standard N1-NN. Hence is now composed of two parts: one that contains information relating to the sequence, MSA, and secondary structure when present; a second part that contains information about annotations extracted from homologous proteins. Both parts are automatically learned, and the compound vector is mapped into the property of interest through a two-layer perceptron as in the standard N1-NN.
The overall number of free parameters in the second architecture can be calculated as:
(6)
in which I is the number of inputs for the network depending on the input coding and on the context window chosen, is the number of hidden units in the network , F
1 is the number of descriptors in the first part in of the feature vector, T is the number of inputs in vector , is the number of the hidden units in the network , F
2 is the number of descriptors in the second part in of the feature vector, is the number of the hidden units in the network and O is the number of the classes being predicted.
Hence the parameters that control the size of the model are , F
1, , F
2 and .
A modified N-to-1 NN is depicted in Figure 5.
Training
We perform tests on three kingdoms (Fungi, Animal and Plant) and with three different architectures (MSA_dataset, MSA+SS_dataset and MSA+HOM_dataset), or nine tests in total. Each test is run in 10-fold cross validation. For each fold a different tenth of the overall dataset is reserved for testing, while the remaining nine tenths are used for learning the parameters of the N1-NN. In particular these nine tenths are further split into a proper training part (eight tenths of the total), and a validation set (one tenth of the total) which is used to monitor the training process but not for learning the N1-NN parameters by gradient descent. For each fold we repeat the training 3 times, with 3 different training/validation splits. Thus for each of the 9 kingdom/architecture combinations we have 3 repetitions x 10 folds, or 30 separate N1-NN training runs in total. In each training set, sequences are replicated as necessary in order to obtain classes of roughly the same size.
Training is performed by gradient descent on the error, which is modelled as the relative entropy between the target class and the output of the network. The overall output of the network (output layer of N
(o)()) is implemented as a softmax function, while all internal squashing functions in the networks in both models are implemented as hyperbolic tangents. The examples are shuffled between epochs. We use a momentum term of 0.9 that speeds up overall training times by a factor 3-5 compared to no momentum. The learning rate is kept fixed at 0.2 throughout training.
Parameters for both the first and the second architecture were experimentally determined in preliminary tests. For the sequence-based N-to-1 NN architecture we use N
H = 14, F = 12 and . For the template-based architecture that includes homology we set , F
1 = 10, , F
2 = 6 and . These values result in approximately 12,500 free parameters for the sequence-based N-to-1 NN, and just over 10,000 for the template-based one. Each training is carried out for up to 10 days on a single state of the art core. Performance on the validation set is measured every ten training epochs, and the ten best performing models on validation are stored. For each fold we ensemble average the three best models saved (one for each repetition) and evaluate them on the corresponding test set. The final result for the 10-fold cross-validation is the average of the results over the ten test sets.