Maximum likelihood estimation of amino acid replacement rate matrix
Cuong Cao Dang, Vincent Lefort, Vinh Sy Le, Quang Si Le, and Olivier Gascuel.
Bioinformatics. 2011, 27(19):275860.
Summary
Amino acid replacement rate matrices are an essential basis of protein studies (
e.g. in phylogenetics or alignment).
A number of generalpurpose matrices have been proposed (
e.g. JTT, WAG, LG) since the seminal work of Margaret Dayhoff and coworkers.
However, it has been shown that matrices specific to certain protein groups (
e.g. mitochondrial) or life domains (
e.g. viruses) differ significantly from general average matrices, and thus perform better when applied to the data they are dedicated.
This web server implements the maximumlikelihood estimation procedure of
Le and Gascuel (2008) and provides a number of tools and facilities.
Users upload a set of multiple protein alignments from their domain of interest, and receive the resulting matrix by email, along with statistics and comparisons with other matrices.
A nonparametric bootstrap is performed as an option, to assess the variability of replacement rate estimates.
Maximumlikelihood trees inferred using the estimated rate matrix are also computed as an option, for each input alignment.
Finelytuned procedures and uptodate ML software (
PhyML 3.0,
XRate) are combined to perform all these heavy computations on our clusters.
Data are protein alignments in
PHYLIP or FASTA format.
As 208 parameters are estimated, large amount data are required, typically hundreds or even thousands of alignments.
For computing time reasons, each alignment has to contain a limited number of taxa (<100).
Too large alignments can be divided in several pieces and given separately.
Users are allowed to provide a starting matrix to initialize the estimation procedure, otherwise
LG is the default.
Once the matrix has been estimated, you will receive it by email in
Paml triangular format, along with a number of files and statistics.
When the bootstrap and/or PhyML options have been checked (and confirmed after email reception of the first batch of results), you will receive separate emails containing the corresponding results.
All these estimations require a lot of computations, typically several days or even weeks.
You should thus ensure that your data are well selected, carefully aligned, etc.
Moreover, only one run is allowed per user at the same time.
Example
A complete example, with all input and output files taken from the estimation of the FLU matrix (
Dang et al. 2010) is available
here.
User guide
Download the
manual in pdf.
Input
Users upload an input dataset
D, that is, a compressed file containing a set of multiple protein alignments in FASTA or PHYLIP interleaved or sequential format (
Felsenstein, 1993).
These multiple alignments may be provided as a flat list or be included in a directory. The compressed file has to be in .zip or .gz format (as obtained from WinZip or any compressing software).
Most multiple alignment programs provide both PHYLIP format (e.g. CLUSTALW) and FASTA format (e.g. CLUSTALW, MAFFT, MUSCLE, PROBCONS).
To convert other formats into FASTA or PHYLIP formats you can use
READSEQ.
D typically contains hundreds or even thousands of multiple alignments.
However, each alignment has to contain less than 100 sequences to reduce the computational burden.
Thus, large alignments should be divided in several subalignments given separately.
Output
Matrix estimation
At the end of the matrix estimation, the server sends an email to the address provided by the user.
This email may contain optional links to execute the bootstrap procedure and/or the trees computation using
PhyML.
It contains several result files within a compressed zip archive ('Estimated_matrix.zip'):

ReplacementMatrix.paml
This file contains the aminoacid replacement matrix file in Paml triangular format, estimated from the input dataset.
An example is given below:
0.55
0.51 0.64
0.74 0.15 5.43
1.03 0.53 0.27 0.03
0.91 3.04 1.54 0.62 0.10
1.58 0.44 0.95 6.17 0.02 5.47
1.42 0.58 1.13 0.87 0.31 0.33 0.57
0.32 2.14 3.96 0.93 0.25 4.29 0.57 0.25
0.19 0.19 0.55 0.04 0.17 0.11 0.13 0.03 0.14
0.40 0.50 0.13 0.08 0.38 0.87 0.15 0.06 0.50 3.17
0.91 5.35 3.01 0.48 0.07 3.89 2.58 0.37 0.89 0.32 0.26
0.89 0.68 0.20 0.10 0.39 1.55 0.32 0.17 0.40 4.26 4.85 0.93
0.21 0.10 0.10 0.05 0.40 0.10 0.08 0.05 0.68 1.06 2.12 0.09 1.19
1.44 0.68 0.20 0.42 0.11 0.93 0.68 0.24 0.70 0.10 0.42 0.56 0.17 0.16
3.37 1.22 3.97 1.07 1.41 1.03 0.70 1.34 0.74 0.32 0.34 0.97 0.49 0.55 1.61
2.12 0.55 2.03 0.37 0.51 0.86 0.82 0.23 0.47 1.46 0.33 1.39 1.52 0.17 0.80 4.38
0.11 1.16 0.07 0.13 0.72 0.22 0.16 0.34 0.26 0.21 0.67 0.14 0.52 1.53 0.14 0.52 0.11
0.24 0.38 1.09 0.33 0.54 0.23 0.20 0.10 3.87 0.42 0.40 0.13 0.43 6.45 0.22 0.79 0.29 2.49
2.01 0.25 0.20 0.15 1.00 0.30 0.59 0.19 0.12 7.82 1.80 0.31 2.06 0.65 0.31 0.23 1.39 0.37 0.31
0.0866 0.0440 0.0391 0.0570 0.0193 0.0367 0.0581 0.0833 O.O244 0.0485 0.0862 O.O620 O.O195 0.0384 0.0458
0.0695 0.0610 0.0144 0.0353 0.0709
The triangle contains the (symmetric) exchangeabilities and the last line displays the equilibrium frequencies (see paper for explanations).
Amino acids are ranked in alphabetic order (Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val).
For example, the frequency of Asn is 3.91 and the exchangeability between Asn and Tyr is 1.09.

ScoringMatrices.txt
This file contains a series of score matrices for various evolutionary distances (d).
These matrices are computed using logodds:
That is, the log ratio of the probability of observing x and a change from x to y in time d, divided by the probability of observing x and y aligned by chance.
We use logarithm to base 10, multiply the resulting logodds by 10 and output a rounded integer value, as in original PAM matrices and required by several programs.
Score matrices are provided in Blast format.
They are analogous to PAM matrices, but are inferred from your datasets and reflect their properties.
They are suitable to search for (e.g. using WUBLAST2) and align (e.g. using CLUSTALW or MAFFT) proteins with similar properties.
Just as with PAM matrices, the score matrix to be used depends on the evolutionary distance (d) between the proteins to be analyzed.
SCORE10 (analogous to PAM10) has to be used with closely related proteins, where the expected number of substitutions per site is 0.10, while SCORE250 (analogous to PAM250) has to be used with distant homologs, where the expected number of substitutions per site is 2.50.
See EBI web site for more explanations.

Stats.txt
This file contains general statistics comparing your matrix with the starting matrix and LG (when both differ).
A Ztest is used to assess the significance of the AIC gains (see below) of your matrix, compared to the starting matrix and LG (when both differ).
When the pvalue is small enough (typically <5%) the gain brought by your matrix on the whole training set is statistically significant.

LKComparedToStartingMatrix.txt
This file contains the loglikelihood gain information for all sites, comparing your output matrix to the starting matrix:

The first line is the number of alignments (n) in the input dataset.

The second line is the total number of sites of all alignments in the input dataset.

The next 4n lines contain the loglikelihood gain information for n alignments, with 4 lines for each alignment as below:

The first line is the name of alignment;

The second line is the number of taxa of this alignment;

The third line is the loglikelihood gain (i.e. the difference of loglikelihoods between your matrix and the starting matrix) with this alignment;

The fourth line contains the loglikelihood gain for every site of this alignment.

AICComparedToStartingMatrix.txt
This file contains the AIC gain information for all sites, comparing your output matrix to the starting matrix.
This information is needed in most cases to compare both matrices, when the starting matrix has not been estimated from the input dataset.
Then, the estimation of the output matrix involves 208 additional free parameters, and its likelihood has to be penalized to obtain a fair comparison.
The AIC gain is equal to the twice the loglikelihood gain, minus 416 (= 2 * 208).
The penalty (416) is equally divided between all sites in the input alignments.
The format of this file is the same as the format of LKComparedToStartingMatrix.txt described above.
When the AIC gain is positive (negative) for a given alignment, your output matrix has a better (worse) fit to this alignment than the starting matrix; when the pvalue of the Ztest is less than (say) 5%, this gain (loss) is statistically significant.

LKComparedToLG.txt
This file contains the loglikelihood gain information for all sites, comparing your output matrix to the starting matrix.
This file is not provided when LG is the starting matrix.

AICComparedToLG.txt
This file contains the AIC gain information for all sites, comparing your output matrix to the starting matrix.
This file is not provided when LG is the starting matrix.

FrequenciesHistogram.pdf
This histogramlike graph displays the equilibrium aminoacid frequencies of the output, starting and LG matrices.

ExchCoefComparedToStartingMatrix.pdf
This bubble graph displays the exchangeability coefficients of the output and starting matrices.

RelDiffComparedToStartingMatrix.pdf
This bubble graph illustrates the relative differences between the exchangeabilities of the output matrix (Q) and the initial matrix (S).
Each bubble represents (Q_{ij}  S_{ij}) / (Q_{ij} + S_{ij}), for a given aminoacid pair i, j.
Values of 1/3 and 2/3 mean that Q_{ij} is 2 and 5 times larger than S_{ij}, respectively.
Values of 1/3 and 2/3 mean that S_{ij} is 2 and 5 times larger than Q_{ij}, respectively.

ExchCoefComparedToLG.pdf
This bubble graph displays the exchangeability coefficients of the output and LG matrices.
This file is not provided when LG is the starting matrix.

RelDiffComparedToLG.pdf
This bubble graph illustrates the relative differences between the exchangeabilities of the output and LG matrices.
This file is not provided when LG is the starting matrix.
Bootstrap procedure
If the user checks the "Estimate variability of rate estimates" option, the pipeline performs a nonparametric bootstrap with 10 replicates.
For each replicate and each alignment
Da from
D, we draw with replacement
Da sites to obtain a pseudotraining set, from which we obtain pseudoestimates of the matrix parameters (exchangeabilities and frequencies).
Frequencies tend to be stable, but exchangeabilities are usually much more variable, depending on the training set size.
At the end of the bootstrap procedure, the server sends an email with enclosed files:

StandardDeviations.txt
This file contains (Paml format) the standard deviations of each of the exchangeability coefficients and aminoacid frequencies, as obtained from their 10 pseudoestimates.

RelativeDeviations.txt
This file (Paml format) contains the relative deviations (standard deviation among 10 replicates divided by original estimate) of the exchangeabilities and frequencies.
These are useful to compare the variability of estimates having very different values.
For example, some exchangeabilities have values close to 10.0, while some others are very close to 0.0.
The lower the relative deviation, the more reliable is the estimate.
However, relative deviations of estimates close to 0.0 are often meaningless as we basically divide 0 by 0.
When the original estimate is 0, the relative deviation is set to 0.

MinValues.txt
This file (Paml format) contains the minimum values (among 10 pseudoestimates) of the exchangeabilities and frequencies.

MaxValues.txt
This file (Paml format) contains the maximum values (among 10 pseudoestimates) of the exchangeabilities and frequencies.
Combined with minimum values, this provides a simple interval estimate of the parameter.
Trees computation
If the user checks the "Compute PhyML trees using estimated matrix" option, the pipeline runs
PhyML with the new matrix and standard options (Γ4 distribution of rates across sites, SPR tree search, SHlike branch supports) for all input alignments.
At the end of the computations, the server sends an email with enclosed file:

FinalTrees.zip
This file (zip archive format) contains the trees in Newick format.