Quang Si Le, Cuong Cao Dang and Olivier Gascuel.

Mol. Biol. Evol. 2012

Most protein substitution models use a single amino acid replacement matrix summarizing the biochemical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors that influence the substitution patterns. In this paper we investigate the use of different substitution matrices for different site evolutionary rates. Indeed, the variability of evolutionary rates corresponds to one of the most apparent heterogeneity factors among sites, and there is no reason to assume that the substitution patterns remain identical regardless of the evolutionary rate. We first introduce LG4M, which is composed of four matrices, each corresponding to one discrete gamma rate category (out of four). These matrices differ in their amino acid equilibrium distributions and in their exchangeabilities, contrary to the standard gamma model where only the global rate differs from one category to another. Next, we present LG4X, which also uses four different matrices, but leaves aside the gamma distribution and follows a distribution-free scheme for the site rates. All these matrices are estimated from a very large alignment database, and our two models are tested using a large sample of independent alignments. Detailed analysis of resulting matrices and models shows the complexity of amino acid substitutions and the advantage of flexible models such as LG4M and LG4X. Both significantly outperform single-matrix models, providing gains of dozens to hundreds of log-likelihood units for most data sets. LG4X obtains substantial gains compared to LG4M, thanks to its distribution-free scheme for site rates. Since LG4M and LG4X display such advantages but require the same memory space and have comparable running times to standard models, we believe that LG4M and LG4X are relevant alternatives to single replacement matrices. All our models, data and software are downloadable from this web site.

Here we only present the differences when LG4X and LG4M models are selected.

For all other models, we strongly recommend using the standard version available on PhyML 3.0 web site

To run LG4X and LG4M you have to use:

- ./Phyml-4X -i test.txt -m LG4X [other options]
- ./Phyml-4X -i test.txt -m LG4M [other options]

- LG4X and LG4M cannot be used with invariant sites (-v) nor the option to estimate amino-acid equilibrium frequencies (-f).
- LG4M automatically comes with the estimation of the shape parameter of the gamma distribution or rates accross sites (-a e).
- LG4X automatically comes with the estimation of the weights and rates of its 4 replacement matrices (X1, X2, X3 and X4).

- Author manuscript (pdf).
- Supplementary tables and figures (pdf).
- Models.
- PhyML-4X: binaries for Linux, macOS and Windows.
- TreeBase alignments and detailed results with 84 TreeBase test alignments.
- HSSP testing alignments, Learning alignments, and ReadMe (more details at PhyML-Structure).
- Detailed results with HSSP test alignments.

- LG web page.
- CAT web page.
- Mixture web page.
- Structure-based models web page.