The FASTML Server - Server for computing Maximum Likelihood ancestral sequence reconstruction - overview

FastML Overview

Introduction
Methodolgy

Character reconstruction
Reconstruction of indels
Running time

Input
Output

Graphical representation of the entire reconstructed ancestral sequences
Output files

Marginal reconstruction
Joint reconstruction
Phylogeny outputs

Introduction

The FastML server is a bioinformatics tool for the reconstruction of ancestral sequences based on the phylogenetic relations between homologous sequences. The server runs several algorithms that reconstruct the ancestral sequences with emphasis on an accurate reconstruction of both indels and characters. For character reconstruction the previously described FastML algorithms [1, 2] are used to efficiently infer the most likely ancestral sequences for each internal node of the tree. Both joint and the marginal reconstructions are provided. For indels reconstruction the sequences are first coded according to the indel events detected within the multiple sequence alignment (MSA) [3] and then a state-of-the-art likelihood model is used to reconstruct ancestral indels states [4, 5]. The server results are the most probable sequences, together with posterior probabilities for each character and indel at each sequence position for each internal node of the tree. FastML is generic and is applicable for any type of molecular sequences (nucleotide, protein, or codon sequences).

Methodology

Given a multiple sequence alignment (MSA) and optionally a phylogenetic tree, the ancestral reconstruction process can be divided into two parts:

1) Character reconstruction - two methods are implemented: the joint and the marginal. In the joint reconstruction, one finds the set of all the internal nodes sequences. In the marginal reconstruction, one infers the most likely sequence in a specific internal node. The results of these two estimation methods are not necessarily the same [1, 2]. Both methods are based on maximum likelihood (ML) algorithms and on an empirical Bayesian approach taking into account the rate variation among sites of the MSA.

2) Reconstruction of indels - a two steps approach is used in order to take into account the dependency among sites:

a. Indels coding.
In this step, the input MSA is coded into a binary indels matix. The server uses an efficient implementation of the simple indel coding [3] according to which each indel with different start and/or end positions is considered to be a separate character. All indels in the data are coded as binary (presence\absence) characters, each of which may represent a gap of multiple sites.

b. Indels reconstruction
In this step, the evolutionary analysis of indels is performed. Given the presence and absence binary matrix of indels in the extant sequences, the algorithms reconstructs the ancestral state of each indel in each internal node of the tree. It is assumed that the observed pattern of indels is the result of deletions and insertions dynamics along a phylogenetic tree. Our state-of-the-art inference methodology is based on a likelihood-based mixture model that allows variable rates of insertions and deletions among indel sites to reliably capture the underlying evolutionary processes [4, 5]. In this approach the posterior probability of indel presence (gap) is computed for each indel site and each node. Alternatively, users can select to reconstruct the ancestral states of the indels based on the maximum parsimony approach. The parsimonious ancestral reconstruction is based on the Sankoff algorithm [6]. In this approach the parsimonious assignment of indel presence (gap) is computed for each indel and each internal node.

Running Time

Running time depends on the number of sequence and their length, the evolutionary model, the proportion of gaps, and the reconstruction algorithm. Codon models are the most time consuming and nucleotide models are the least. Additionally, accounting for among site rate variation is significantly more complex for the joint reconstruction, comparing to the marginal reconstruction [2]. Additional info regarding the effect of each parameter on the running time is detailed below (under 'Advanced Options and Details' section). To aid users with estimating running time for their datasets, we simulated sequences using INDELible and computed their ancestral sequences using FastML.

The parameters used for the simulations were: codon model (M0 with kappa=2.5, omega=0.5); Power law insertion length distribution (a=3, M=60). Power law deletion length distribution (a=3.1, M=500); insertion rate = deletion rate = 0.01; The sequence length (without gaps) was set to be approximately 350; A random tree was generated for each simulation with parameters equal to: birth=1.1, death=0.2566 and sample mut=0.34. Varied numbers of sequences were simulated (See Fig. 1 and 2) based on the abovementioned parameters.

FastML was input with the simulated multiple sequence alignment and using (i) the simulated tree (Fig 1) or (ii) building NJ tree (Fig2). The running time for each of these scenarios was measured for (a) Amino Acid sequences (JTT model); (b) Nucleotides sequences (JC model) and (c) Codons (yang model). The average running time of FastML over twenty replicates of such simulated sequence are shown below. In addition, an estimation of the running time is given for each run.

FIGURE 1

FIGURE 2

Input
FastML requires only an MSA as input. The MSA can be in any type of molecular sequences (nucleotide, protein, or codon sequences). The sequences may be in Fasta format only. If you are working with other sequence file formats such as Clustal, Phylip, etc., we suggest using software such as Bioedit to convert your format to Fasta.

Advanced Options and Details

Generating the phylogenetic tree
The ancestral sequence reconstruction is based on a phylogeny that should be consistent with the input MSA. The user may provide a phylogenetic tree file (in Newick format) as input. Alternatively, the user is allowed to choose between two methods for phylogeny reconstruction: (1) The neighbor joining (NJ) algorithm [7] as implemented in the FastML program; or (2) a maximum likelihood (ML) algorithm as implemented in the RAxML program [8]. In either case, the branch lengths are optimized based on the ML approach and using an expectation maximization (EM) algorithm [9, 10].
Evolutionary models
FastML implements an assembly of different evolutionary models according the type of the molecular sequences.

For nucleotides, four evolutionary models are implemented in the FastML server: (i) the Juke and Cantor 69 model (JC69), which assumes equal base frequencies and equal substitution rates [11]. (ii) The Tamura 92 model that uses only one parameter, which captures variation in G-C content [12]. (iii) The HKY85 model, which distinguishes between transitions and transversions and allows unequal base frequencies [13]. (iv) The General Time Reversible (GTR) model, which is the most general time-reversible model. The GTR parameters consist of an equilibrium base frequency vector, giving the frequency at which each base occurs at each site, and the rate matrix [14]

For amino acid sequences different matrices are available including: JTT [15], Dayhoff [16], WAG [17], and LG [18]. The WAG matrix has been inferred from a large database of sequences comprising a broad range of protein families and is thus suited for distantly related amino acid sequences [17]. The LG [18] matrix incorporates variability of evolutionary rates across sites and was shown to outperform other substitutions matrices for proteins. Moreover, two context dependents matrices are available: the mtREV [19] and cpREV [20] matrices that are suitable for mitochondrial, and chloroplast DNA-encoded proteins, respectively.
For codon sequences both the theoretical M5 model [21], the empirical codon matrix [22] and the MEC model [23] are implemented.
Branch lengths optimization
In order to reduce the running time of the reconstruction, the user may turn off the optimization of branch lengths. Note that the accuracy of the branch lengths has a critical effect on the accuracy of the reconstruction. Therefore, this option is recommended for use only if the branch lengths are already optimized according to the same evolutionary model that is used for the reconstruction.
Marginal reconstruction vs. joint reconstruction
The most likely reconstruction of the marginal and the joint maximum-likelihood methods may differ in some cases [see 1]. In marginal reconstruction, the most likely sequence at a specific internal node is inferred, averaging over all possible ancestral states at all other nodes. However, in joint reconstruction, the most likely set of ancestral states at all the internal nodes is inferred. When accounting for among site rate variation (using a gamma distribution, see below), the joint reconstruction time-complexity is exponential. Although in most practical scenarios both types of reconstruction are applicable for large datasets, turning off the joint reconstruction may speed up the algorithm.
Accounting for among site rate variation
By default FastML uses a gamma distribution to model among site rate variation. The user may choose instead to assume an homogenous model, resulting in a much faster, yet less accurate, reconstruction.

Output

FastML directs you to a web page called "FastML Job Status Page". This web page is automatically updated every 30 seconds, showing messages regarding the different stages of the server activity. When the calculation finishes, several links appear.

Graphical representation of the entire reconstructed ancestral sequences: Using the Jalview multiple alignment editor [24] the server generates two windows: the first contains the MSA that was given as input to which the most probable ancestral sequences are added; the second contains the graphical representation of the phylogeny. Using these two windows, the server provides:
- A projection of the ancestral sequences onto the phylogeny: Internal nodes of the phylogenetic trees are numbered so that their associated reconstructed ancestral sequences can be easily located in the MSA.
- A colored-scaled projection of the reconstruction's confident: On the MSA window, the reconstructed ancestral sequences at each node are color-coded according to the posterior probabilities. Green represents a high confident while pink represents a relatively low one. In addition, by moving the cursor over the ancestral sequences, one can view the exact posterior probability of each node and each site.
- An option to download specific sequences of interest: Clicking on a specific node of interest on the phylogeny leads to the selection of all its descendent leaves, both on the tree and also in the MSA. The user can then download the sequences of these leaves by right-clicking any of the selected sequence's names on the MSA window and then clicking on "Selection" and "Output to Textbox". Similarly, the user can download the ancestral sequence reconstruction of a specific internal node by selecting it in the MSA window.
These graphical outputs are available for: (i) the characters joint reconstruction; (ii) the characters marginal reconstruction; and (iii) the integration of the marginal reconstruction together with the indel reconstruction.
Output files
1. Marginal reconstruction:
  - Ancestral sequences according to the marginal reconstruction (including \ without reconstruction of indels): An MSA file (Fasta format) that contains the most probable ancestral sequences according to the marginal reconstruction.
  - A graphical logo of the inferred ancestral sequences: Here we provide an option to generate for each ancestral node a logo figure using WebLogo [25] that represents the posterior probabilities of each reconstructed character, thus showing all possible alternative reconstructions in one figure.
  - A set of the most likely ancestral sequences: Here we provide an option to generate for each ancestral node a set of the k most probable ancestral sequences for each ancestral node (where k is defined by the user).
  - A sample of ancestral sequences: Here we provide an option to generate for each ancestral node a set of sequences sampled according to the posterior probabilities for each site (where l is defined by the user).
  - Posterior probabilities of the most probable ancestral sequence reconstructions: The integrated posterior probability of each character and indel at each internal node.
  - Posterior probability of each character at each site and each internal node. The possible reconstructed characters are ordered according to their probability, from the most likely to the least probable one.
  - Posterior probability of each indel event at each ancestral node.
2. Joint reconstruction:
  - Ancestral sequences according to the joint reconstruction: An MSA file (Fasta format) that contains the most probable ancestral sequences according to the joint reconstruction.
  - Log likelihood values of the ancestral sequence reconstruction at each position.
3. Phylogeny outputs:
  - Tree in Newick format: The reconstructed phylogeny in newick format, including the names used for the leaves and internal nodes.
  - Tree in Ancestral format:The reconstructed phylogeny in ancestral format in which the parent node and the children nodes of each node of the tree are specified.

References

Pupko, T., et al., A fast algorithm for joint reconstruction of ancestral amino acid sequences. Mol Biol Evol, 2000. 17(6): p. 890-6.
Pupko, T., et al., A branch-and-bound algorithm for the inference of ancestral amino-acid sequences when the replacement rate varies among sites: Application to the evolution of five gene families. Bioinformatics, 2002. 18(8): p. 1116-23.
Simmons, M.P. and H. Ochoterena, Gaps as characters in sequence-based phylogenetic analyses. Syst Biol, 2000. 49(2): p. 369-81.
Cohen, O., et al., A likelihood framework to analyse phyletic patterns. Philos Trans R Soc Lond B Biol Sci, 2008. 363(1512): p. 3903-11.
Cohen, O. and T. Pupko, Inference of gain and loss events from phyletic patterns using stochastic mapping and maximum parsimony--a simulation study. Genome Biol Evol, 2011. 3: p. 1265-75.
Sankoff, D., Minimal Mutation Trees of Sequences. Siam Journal on Applied Mathematics, 1975. 28(1): p. 35-42.
Saitou, N. and M. Nei, The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol, 1987. 4(4): p. 406-25.
Stamatakis, A., T. Ludwig, and H. Meier, RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics, 2005. 21(4): p. 456-63.
Felsenstein, J., Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol, 1981. 17(6): p. 368-76.
Friedman, N., et al., A structural EM algorithm for phylogenetic inference. J Comput Biol, 2002. 9(2): p. 331-53.
Jukes, T.H. and C.R. Cantor, Evolution of protein molecules, in Mammalian protein metabolism, H.N. Munro, Editor. 1969, New York: Academic Press. pp. 21-123.
Tamura, K., Estimation of the number of nucleotide substitutions when there are strong transition-transversion and G+C-content biases. Mol Biol Evol, 1992. 9(4): p. 678-87.
Hasegawa, M., H. Kishino, and T. Yano, Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J Mol Evol, 1985. 22(2): p. 160-74.
Tavare,S. Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci., 1986. 17: p. 57-86.
Jones, D.T., W.R. Taylor, and J.M. Thornton, The rapid generation of mutation data matrices from protein sequences. Comput Appl Biosci, 1992. 8(3): p. 275-82.
Dayhoff, M., R. Schwartz, and B. Orcutt, A model of evolutionary change in proteins, in Atlas of protein sequence and structure. Volume 5, suppl. 3., M. Dayhoff, Editor. 1978, Washington (DC): National Biomedical Research Foundation; 1978. p. 345-352.
Whelan, S. and N. Goldman, A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol, 2001. 18(5): p. 691-9.
Le, S.Q. and O. Gascuel, An improved general amino acid replacement matrix. Mol Biol Evol, 2008. 25(7): p. 1307-20.
Adachi, J. and M. Hasegawa, Model of amino acid substitution in proteins encoded by mitochondrial DNA. J Mol Evol, 1996. 42(4): p. 459-68.
Adachi, J., et al., Plastid genome phylogeny and a model of amino acid substitution for proteins encoded by chloroplast DNA. J Mol Evol, 2000. 50(4): p. 348-58.
Nielsen, R. and Z. Yang, Likelihood models for detecting positively selected amino acid sites and applications to the HIV-1 envelope gene. Genetics, 1998. 148(3): p. 929-36.
Schneider, A., G.M. Cannarozzi, and G.H. Gonnet, Empirical codon substitution matrix. BMC Bioinformatics, 2005. 6: p. 134.
Doron-Faigenboim, A. and T. Pupko, A combined empirical and mechanistic codon model. Mol Biol Evol, 2007. 24(2): p. 388-97.
Waterhouse, A.M., et al., Jalview Version 2--a multiple sequence alignment editor and analysis workbench. Bioinformatics, 2009. 25(9):p. 1189-1191.
Crooks, G.E., Hon, G., Chandonia, J.M. and Brenner, S.E. WebLogo: a sequence logo generator. Genome Res, 2004. 14(6): p. 1188-1190.

To the top

	The FASTML Server Server for computing Maximum Likelihood ancestral sequence reconstruction



	HOME OVERVIEW GALLERY SOURCE CODE NEW FASTML PROGRAM CITING & CREDITS