Maximum likelihood estimation of amino acid replacement rate matrix

Cuong Cao Dang, Vincent Lefort, Vinh Sy Le, Quang Si Le, and Olivier Gascuel.
Bioinformatics. 2011, 27(19):2758-60.

Summary

Amino acid replacement rate matrices are an essential basis of protein studies (e.g. in phylogenetics or alignment). A number of general-purpose matrices have been proposed (e.g. JTT, WAG, LG) since the seminal work of Margaret Dayhoff and co-workers. However, it has been shown that matrices specific to certain protein groups (e.g. mitochondrial) or life domains (e.g. viruses) differ significantly from general average matrices, and thus perform better when applied to the data they are dedicated.

This web server implements the maximum-likelihood estimation procedure of Le and Gascuel (2008) and provides a number of tools and facilities. Users upload a set of multiple protein alignments from their domain of interest, and receive the resulting matrix by email, along with statistics and comparisons with other matrices. A non-parametric bootstrap is performed as an option, to assess the variability of replacement rate estimates. Maximum-likelihood trees inferred using the estimated rate matrix are also computed as an option, for each input alignment. Finely-tuned procedures and up-to-date ML software (PhyML 3.0, XRate) are combined to perform all these heavy computations on our clusters.

Data are protein alignments in PHYLIP or FASTA format. As 208 parameters are estimated, large amount data are required, typically hundreds or even thousands of alignments. For computing time reasons, each alignment has to contain a limited number of taxa (<100). Too large alignments can be divided in several pieces and given separately. Users are allowed to provide a starting matrix to initialize the estimation procedure, otherwise LG is the default.

Once the matrix has been estimated, you will receive it by email in Paml triangular format, along with a number of files and statistics. When the bootstrap and/or PhyML options have been checked (and confirmed after email reception of the first batch of results), you will receive separate emails containing the corresponding results. All these estimations require a lot of computations, typically several days or even weeks. You should thus ensure that your data are well selected, carefully aligned, etc. Moreover, only one run is allowed per user at the same time.

Example

A complete example, with all input and output files taken from the estimation of the FLU matrix (Dang et al. 2010) is available here.

User guide

Download the manual in pdf.

Input

Users upload an input dataset D, that is, a compressed file containing a set of multiple protein alignments in FASTA or PHYLIP interleaved or sequential format (Felsenstein, 1993). These multiple alignments may be provided as a flat list or be included in a directory. The compressed file has to be in .zip or .gz format (as obtained from WinZip or any compressing software). Most multiple alignment programs provide both PHYLIP format (e.g. CLUSTALW) and FASTA format (e.g. CLUSTALW, MAFFT, MUSCLE, PROBCONS). To convert other formats into FASTA or PHYLIP formats you can use READSEQ. D typically contains hundreds or even thousands of multiple alignments. However, each alignment has to contain less than 100 sequences to reduce the computational burden. Thus, large alignments should be divided in several sub-alignments given separately.

Output

Matrix estimation
At the end of the matrix estimation, the server sends an email to the address provided by the user. This email may contain optional links to execute the bootstrap procedure and/or the trees computation using PhyML. It contains several result files within a compressed zip archive ('Estimated_matrix.zip'):
  1. ReplacementMatrix.paml
    This file contains the amino-acid replacement matrix file in Paml triangular format, estimated from the input dataset.
    An example is given below:
    0.55
    0.51 0.64
    0.74 0.15 5.43
    1.03 0.53 0.27 0.03
    0.91 3.04 1.54 0.62 0.10
    1.58 0.44 0.95 6.17 0.02 5.47
    1.42 0.58 1.13 0.87 0.31 0.33 0.57
    0.32 2.14 3.96 0.93 0.25 4.29 0.57 0.25
    0.19 0.19 0.55 0.04 0.17 0.11 0.13 0.03 0.14
    0.40 0.50 0.13 0.08 0.38 0.87 0.15 0.06 0.50 3.17
    0.91 5.35 3.01 0.48 0.07 3.89 2.58 0.37 0.89 0.32 0.26
    0.89 0.68 0.20 0.10 0.39 1.55 0.32 0.17 0.40 4.26 4.85 0.93
    0.21 0.10 0.10 0.05 0.40 0.10 0.08 0.05 0.68 1.06 2.12 0.09 1.19
    1.44 0.68 0.20 0.42 0.11 0.93 0.68 0.24 0.70 0.10 0.42 0.56 0.17 0.16
    3.37 1.22 3.97 1.07 1.41 1.03 0.70 1.34 0.74 0.32 0.34 0.97 0.49 0.55 1.61
    2.12 0.55 2.03 0.37 0.51 0.86 0.82 0.23 0.47 1.46 0.33 1.39 1.52 0.17 0.80 4.38
    0.11 1.16 0.07 0.13 0.72 0.22 0.16 0.34 0.26 0.21 0.67 0.14 0.52 1.53 0.14 0.52 0.11
    0.24 0.38 1.09 0.33 0.54 0.23 0.20 0.10 3.87 0.42 0.40 0.13 0.43 6.45 0.22 0.79 0.29 2.49
    2.01 0.25 0.20 0.15 1.00 0.30 0.59 0.19 0.12 7.82 1.80 0.31 2.06 0.65 0.31 0.23 1.39 0.37 0.31
    
    0.0866 0.0440 0.0391 0.0570 0.0193 0.0367 0.0581 0.0833 O.O244 0.0485 0.0862 O.O620 O.O195 0.0384 0.0458
    0.0695 0.0610 0.0144 0.0353 0.0709
    
    The triangle contains the (symmetric) exchangeabilities and the last line displays the equilibrium frequencies (see paper for explanations). Amino acids are ranked in alphabetic order (Ala Arg Asn Asp Cys Gln Glu Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val). For example, the frequency of Asn is 3.91 and the exchangeability between Asn and Tyr is 1.09.
  2. ScoringMatrices.txt
    This file contains a series of score matrices for various evolutionary distances (d).
    These matrices are computed using log-odds:
    Log (P(x).P(x→y|d) / P(x).P(y))
    That is, the log ratio of the probability of observing x and a change from x to y in time d, divided by the probability of observing x and y aligned by chance.
    We use logarithm to base 10, multiply the resulting log-odds by 10 and output a rounded integer value, as in original PAM matrices and required by several programs. Score matrices are provided in Blast format. They are analogous to PAM matrices, but are inferred from your datasets and reflect their properties. They are suitable to search for (e.g. using WU-BLAST2) and align (e.g. using CLUSTALW or MAFFT) proteins with similar properties. Just as with PAM matrices, the score matrix to be used depends on the evolutionary distance (d) between the proteins to be analyzed. SCORE10 (analogous to PAM10) has to be used with closely related proteins, where the expected number of substitutions per site is 0.10, while SCORE250 (analogous to PAM250) has to be used with distant homologs, where the expected number of substitutions per site is 2.50.
    See EBI web site for more explanations.
  3. Stats.txt
    This file contains general statistics comparing your matrix with the starting matrix and LG (when both differ). A Z-test is used to assess the significance of the AIC gains (see below) of your matrix, compared to the starting matrix and LG (when both differ). When the p-value is small enough (typically <5%) the gain brought by your matrix on the whole training set is statistically significant.
  4. LKComparedToStartingMatrix.txt
    This file contains the log-likelihood gain information for all sites, comparing your output matrix to the starting matrix:
    • The first line is the number of alignments (n) in the input dataset.
    • The second line is the total number of sites of all alignments in the input dataset.
    • The next 4n lines contain the log-likelihood gain information for n alignments, with 4 lines for each alignment as below:
      • The first line is the name of alignment;
      • The second line is the number of taxa of this alignment;
      • The third line is the log-likelihood gain (i.e. the difference of log-likelihoods between your matrix and the starting matrix) with this alignment;
      • The fourth line contains the log-likelihood gain for every site of this alignment.
  5. AICComparedToStartingMatrix.txt
    This file contains the AIC gain information for all sites, comparing your output matrix to the starting matrix. This information is needed in most cases to compare both matrices, when the starting matrix has not been estimated from the input dataset. Then, the estimation of the output matrix involves 208 additional free parameters, and its likelihood has to be penalized to obtain a fair comparison. The AIC gain is equal to the twice the log-likelihood gain, minus 416 (= 2 * 208). The penalty (416) is equally divided between all sites in the input alignments. The format of this file is the same as the format of LKComparedToStartingMatrix.txt described above. When the AIC gain is positive (negative) for a given alignment, your output matrix has a better (worse) fit to this alignment than the starting matrix; when the p-value of the Z-test is less than (say) 5%, this gain (loss) is statistically significant.
  6. LKComparedToLG.txt
    This file contains the log-likelihood gain information for all sites, comparing your output matrix to the starting matrix.
    This file is not provided when LG is the starting matrix.
  7. AICComparedToLG.txt
    This file contains the AIC gain information for all sites, comparing your output matrix to the starting matrix.
    This file is not provided when LG is the starting matrix.
  8. FrequenciesHistogram.pdf
    This histogram-like graph displays the equilibrium amino-acid frequencies of the output, starting and LG matrices.
  9. ExchCoefComparedToStartingMatrix.pdf
    This bubble graph displays the exchangeability coefficients of the output and starting matrices.
  10. RelDiffComparedToStartingMatrix.pdf
    This bubble graph illustrates the relative differences between the exchangeabilities of the output matrix (Q) and the initial matrix (S).
    Each bubble represents (Qij - Sij) / (Qij + Sij), for a given amino-acid pair i, j.
    Values of 1/3 and 2/3 mean that Qij is 2 and 5 times larger than Sij, respectively.
    Values of -1/3 and -2/3 mean that Sij is 2 and 5 times larger than Qij, respectively.
  11. ExchCoefComparedToLG.pdf
    This bubble graph displays the exchangeability coefficients of the output and LG matrices.
    This file is not provided when LG is the starting matrix.
  12. RelDiffComparedToLG.pdf
    This bubble graph illustrates the relative differences between the exchangeabilities of the output and LG matrices.
    This file is not provided when LG is the starting matrix.
Bootstrap procedure
If the user checks the "Estimate variability of rate estimates" option, the pipeline performs a non-parametric bootstrap with 10 replicates. For each replicate and each alignment Da from D, we draw with replacement |Da| sites to obtain a pseudo-training set, from which we obtain pseudo-estimates of the matrix parameters (exchangeabilities and frequencies). Frequencies tend to be stable, but exchangeabilities are usually much more variable, depending on the training set size. At the end of the bootstrap procedure, the server sends an email with enclosed files:
  1. StandardDeviations.txt
    This file contains (Paml format) the standard deviations of each of the exchangeability coefficients and amino-acid frequencies, as obtained from their 10 pseudo-estimates.
  2. RelativeDeviations.txt
    This file (Paml format) contains the relative deviations (standard deviation among 10 replicates divided by original estimate) of the exchangeabilities and frequencies. These are useful to compare the variability of estimates having very different values. For example, some exchangeabilities have values close to 10.0, while some others are very close to 0.0. The lower the relative deviation, the more reliable is the estimate. However, relative deviations of estimates close to 0.0 are often meaningless as we basically divide 0 by 0. When the original estimate is 0, the relative deviation is set to 0.
  3. MinValues.txt
    This file (Paml format) contains the minimum values (among 10 pseudo-estimates) of the exchangeabilities and frequencies.
  4. MaxValues.txt
    This file (Paml format) contains the maximum values (among 10 pseudo-estimates) of the exchangeabilities and frequencies. Combined with minimum values, this provides a simple interval estimate of the parameter.
Trees computation
If the user checks the "Compute PhyML trees using estimated matrix" option, the pipeline runs PhyML with the new matrix and standard options (Γ4 distribution of rates across sites, SPR tree search, SH-like branch supports) for all input alignments. At the end of the computations, the server sends an email with enclosed file:
  1. FinalTrees.zip
    This file (zip archive format) contains the trees in Newick format.