Determination of optimal parameters of MAFFT program based on BAliBASE3.0 database

Background Multiple sequence alignment (MSA) is one of the most important research contents in bioinformatics. A number of MSA programs have emerged. The accuracy of MSA programs highly depends on the parameters setting, mainly including gap open penalties (GOP), gap extension penalties (GEP) and substitution matrix (SM). This research tries to obtain the optimal GOP, GEP and SM rather than MAFFT default parameters. Results The paper discusses the MAFFT program benchmarked on BAliBASE3.0 database, and the optimal parameters of MAFFT program are obtained, which are better than the default parameters of CLUSTALW and MAFFT program. Conclusions The optimal parameters can improve the results of multiple sequence alignment, which is feasible and efficient.

provides an empirical basis for selection of gap penalties and demonstrates how optimal gap penalties behave as a function of the target evolutionary distance of the substitution matrix. Madhusudhan et al. (2006) suggests the variable penalty formula according to the structure of sequence based on dynamic programming. But these formulae are not widely used. Gondro and Kinghorn (2007) think that the gap penalty parameters were determined by the experience.
How to determine that the optimum parameters have no theoretical framework at present. Different parameter combinations could result in different MSA. The majority of users use default parameters when applying these alignment tools, but the results could not be the best. In addition, there is no effective method to determine the optimal parameter directly, so it is difficult to get the local optimal solution through online tools. Pais et al. (2014) summarize the efficiency of MSA methods and tools, such as CLUSTALW, CLUSTAL OMEGA, DIALIGN-TX, MAFFT, MUSCLE, POA, PROBALIGN, PROB-CONS and T-Coffee. They obtain the following conclusion: T-Coffee and MAFFT are more efficiency to MSA (Pais et al. 2014). Nuin et al. (2006) compared nine commonly used MSA programs: CLUSTAL W, Dialign2.2, T-Coffee, POA, muscle, MAFFT, PROB-CONS, DIALIGN-T and KALIGN, and obtained the following conclusions: among the nine programs tested, the iterative approach available in MAFFT (L-INS-i) and PROB-CONS were consistently the most accurate, with MAFFT being the faster of the two. The above analyses reveal that MAFFT is the best choice for protein sequence alignment based on its overall alignment quality and processing speed. Ahola et al. (2006) introduce a statistical score that assesses the quality of a given multiple sequence alignment, and compare the AQ (alignment quality) scores of the seven alignment methods using the BAliBASE as a benchmarking database. According to these results, the MAFFT strategy L-INS-i outperforms the other methods. These conclusions are described in Web page (MAFFT Version 6). The speed and accuracy of MSA are most important evaluation criteria. With development of CPU and GPU technology, computer hardware can improve the MSA speed, so improving accuracy of MSA is the main factor influencing the MSA. The paper tries to obtain the optimal parameters combining GOP, GEP and SM based on MAFFT program.
The accuracy of MSA is usually assessed by scores. A number of score functions exist for alignment optimization, e.g. weighted sum-of-pairs, maximum likelihood, minimum entropy, star, and consensus (Gotoh 1999). The most popular score function is the weighted sum-of-pairs score (WSP). The best known standard measures for the evaluation of multiple sequence alignments are sum-of-pair score (SPS) and column score (CS) defined in (Thompson et al. 1999). Ahola et al. (2006) propose that statistical score assesses the quality of a given multiple sequence alignment. In the Ref. (Francisco et al. 2015), a set of novel regression approaches are proposed for the MSA evaluation by comparing several supervised learning and mathematical methodologies.

MAFFT program
MAFFT is a high speed multiple sequence alignment program for unix-like operating systems. The software is named after the acronym multiple alignment using fast Fourier transform after the major computational technique used by the method (Katoh et al. 2002). Due to the increasing necessity for MSA of distant homologs, Katoh et al. (2005)  MAFFT (MAFFT-7.220-WIN64 version) offers various multiple alignment strategies. They are classified into three types, (a) the progressive method, (b) the iterative refinement method with the WSP score, and (c) the iterative refinement method using both the WSP and consistency scores. In general, there is a tradeoff between speed and accuracy. The order of speed is a > b > c, whereas the order of accuracy is a < b < c. The following are the detailed procedures for the major options of MAFFT illustrated in Table 1.
References prove that MAFFT-L-INS-i and E-INS-i show the highest accuracy scores in currently available sequence alignment programs. However, the difference among MAFFT-L-INS-i, E-INS-i, TCoffee and ProbCons is quite small and not statistically significant in most cases (Ahola et al. 2006;MAFFT Version 6;Gotoh 1999). From Table 1, we can find that GOP is 1.53, GEP is 0.123 and substitution matrix is Blosum62 in MAFFT-L-INS-i and E-INS-I algorithm. So, our study tries to obtain the optimal GOP, GEP and substitution matrix rather than MAFFT default parameters.

Sum-of-pairs score (SPS)
To assess the performance of the parameters in this study, we use the SPS scores to estimate the quality of an alignment. The sum-of-pairs score (SPS) function used in (Thompson et al. 1999). Suppose there is a test alignment of N sequences consisting of M columns, and designate the ith column in the alignment by A i1 , A i2 , . . . , A iN . For each pair of residues A ij and A ik , we define p ijk such that p ijk = 1 if residues A ij and A ik are aligned with each other in the reference alignment, otherwise p ijk = 0. The score S i for the ith column is defined as The SPS for the alignment is given by where M r is the number of columns in the reference alignment and S ri is the score S i for the ith column in the reference alignment.

BAliBASE3 database
With the evolution of the sequence and structure databases resulting from high throughput technologies, the multiple alignments of large numbers of complex, multi-domain sequences have become a standard requirement. Sequence alignment benchmarks must not only evolve to accurately represent the requirements, but also to avoid over-fitting of the methods to a particular set of test cases. BAliBASE release 3.0 is designed to respond to these challenges (Thompson et al. 2005). The size of the alignments in the BAliBASE benchmark has been increased in release 3.0 to reflect the ever-growing sequence and 3D structure databases. Furthermore, because the reference sequences in the database are manual comparison, the results are more biological characteristics and they are common databases of test algorithm.
The BAliBASE 3.0 contains 218 reference alignments shown in Table 2, which are distributed into five reference sets. Reference set 1 is a set of equal-distant sequences, which are organized into two reference subsets, RV11 and RV12. RV11 contains sequences sharing >20 % identity and RV12 contains sequences sharing 20-40 % identity. Reference set 2 (RV20) contains families with >40 % identity and a significantly divergent orphan sequence that shares <20 % identity with the rest of the family members. Reference set 3 (RV30) contains families with >40 % identity that share <20 % identity between each two different sub-families. Reference set 4 (RV40) is a set of sequences with large N/C -terminal extensions. Reference set 5 (RV50) is a set of sequences with large internal insertions. (1)

Experiment setting
In the experiment, we use the database of BaliBASE 3.0 shown in Table 2.
To assess the performance of the formulas in this study, SPS (sum-of-pair score) is as objective function. The SPS is calculated such that the score increases with the number of sequences correctly aligned (Thompson et al. 1999). It is used to determine the extent to which the programs succeed in aligning some, if not all, of the sequences in an alignment. If the SPS is higher, the results of alignment are closed to reference alignment and even better than the reference alignment.
To obtain the optimal parameters combination of MAFFT program, we used batch processing through Perl programming (ActivePerl 5.16.2 version) language on Win-dows7 OS: the step of GOP is 0.1, the step of GEP is 0.03, the SM is BLOSUM30/BLO-SUM45/BLOSUM62/BLOSUM80/PAM100/PAM200 respectively. The batch processing script is following: For each alignment of BaliBASE 3.0, the number of alignment results is 692, because there are 692 kinds of parameters combination pattern. For each combination of the three parameters (SM, GOP, GEP), each alignment of reference can obtain SPS score. Figure 1 illustrates all the SPS results of six References. In each of these graphs, the SM is BLOSUM45/BLOSUM45/BLOSUM62/BLOSUM45/BLOSUM80/BLOSUM45 respectively. The SPS reaches the maximal value when the GOP, GEP and SM is certain value respectively.

The determination of optimal substitution matrix parameters
The determination of the optimal matrix is as follows: 1. Compute SP scores of each substitution matrix according to the parameters setting.
For each substitution matrix, GOP and GEP have 692 different combination modes, so the number of SP score is (692× the number of reference alignment). For example, RV11 has 38 reference alignments, so the number of SP scores is 38 × 692. 2. Compute the mean value of SP scores in each GOP/GEP combination mode, which is denoted by MEAN_SPS. For example, the number of SP scores of RV11 is 38 × 692, so the number of MEAN_SPS is 1 × 692. 3. Compute the maximum value of MEAN_SPS, which is denoted by MAX_MEAN_ SPS. For example, the number of MEAN_SPS of RV11 is 1 × 692, so the number of MAX_MEAN_SPS is 1. The greater the MAX_MEAN_SPS value, the higher the alignment accuracy in the GOP/GEP combination mode. 4. The MAX_MEAN_SPS values with each substitution matrix and each data set are listed in Table 3. 5. The maximum value of MAX_MEAN_SPS is corresponding to the optimal matrix.
Bold figures represent the best results.

The determination of optimal GOP/GEP parameters
The determination of the optimal GOP and GEP is as follows: the best MAX_MEAN_ SPS values of each data set can be obtained, and they are corresponding to the certain GOP and GEP values ( Table 3). As shown in Table 4, parameters obtained from our experiments are different from MAFFT default parameters. Fig. 1 The value of SPS Table 5 shows the mean of SPS values from different algorithm. The SPS value obtained from MAFFT default parameters is higher than the CLUSTALW (CLUSTALW-2.1-WIN) default parameters. The SPS value of MAFFT measure parameters is higher than MAFFT default parameters. Figure 2 illustrates the SPS value of MAFFT measure parameters, MAFFT default parameters and CLUSTALW default parameters. For set of sequences, the SPS value is the best in MAFFT measure parameters.

Discussion
In this paper, we use MAFFT tool to improve MSA. In order to get better SPS results, we abandon the default parameters in the process of MSA, and seek to find the optimal parameter combination. Experimental results show that the MSA results highly depend on the substitution matrix and gap penalties. Applying MAFFT tool with optimal parameter combination, we find that the accuracy of MSA result is higher than MAFFT and ClustalW with default parameters. This study allows to optimize the multiple sequence alignment results and provides a new idea for multiple sequence alignment. In the future work, firstly, we can use these proposed formulas and similar method to find optimal parameter combination of other MSA tools, such as CLUSTALW, MUSCLE   and so on. Secondly, the article mainly discusses the optimal parameters combination of MAFFT program based on BAliBASE3.0 database. Because the reference sequences in the BAliBASE are manual comparison, the alignment is more biological characteristic, and it is one of the common databases of test algorithm. In the future, we will discuss the other benchmarks to find the optimal parameters of MAFFT. Maybe, the default parameters are the best results for MAFFT program on other benchmarks. However parameter of MAFFT program is improved, the research is not ended.