ATGC: Phylogenetic models

LG: An Improved, General Amino-Acid Replacement Matrix

Le S.Q., Gascuel O.

Molecular Biology and Evolution. 2008 25(7):1307-20.

Amino-acid replacement matrices are an essential basis of protein phylogenetics. They are used to compute substitution probabilities along phylogeny branches, and thus the likelihood of the data. They are also essential in protein alignment. A number of replacement matrices and methods to estimate these matrices from protein alignments have been proposed since the seminal work of Dayhoff et al. (1972). An important advance was achieved by Whelan and Goldman (2001), who designed an efficient maximum-likelihood estimation approach that accounts for the phylogenies of sequences within each training alignment. We further refine this method by incorporating the variability of evolutionary rates across sites in the matrix estimation, and using a much larger and diverse database than BRKALN, which was used to estimate the WAG matrix. To estimate our new matrix (called LG), we use an adaptation of the XRATE software and 3912 alignments from Pfam, comprising ~50,000 sequences and ~6.5 million residues overall. To evaluate the LG performance, we use an independent sample consisting of 59 alignments from TreeBase, and randomly divide Pfam alignments into 3,412 training and 500 test alignments. The comparison with WAG and JTT shows a clear likelihood improvement. With TreeBase, we find that: (1) the average AIC gain per site is 0.25 and 0.42, when compared to WAG and JTT, respectively; (2) LG is significantly better than WAG for 38 alignments (among 59), and significantly worse with 2 alignments only; (3) tree topologies inferred with LG, WAG and JTT frequently differ, indicating that using LG impacts the likelihood value but also the output tree. Results with the test alignments from Pfam are analogous.

This web page provides: the LG matrix (learned from all 3,912 Pfam alignments); a PhyML (Guindon & Gascuel 2003) implementation of LG; the 500 Pfam test alignments; the 59 TreeBase test alignments; AIC results that we obtained on these alignments with a number of options and models. AIC difference between LG and WAG, depending on the gamma shape parameter (alpha) value;

Download

PhyML with LG (LG is default option when analyzing protein datasets).
LG paper (pdf).
LG Model (Equilibrium amino-acid frequencies and exchangeability matrix in PAML format).
3,912 Pfam training alignments (PHYLIP format).
500 Pfam test alignments (PHYLIP format).
59 TreeBase test alignments (PHYLIP format).
AIC gain of LG+G4+I over WAG+G4+I as a function of the gamma shape parameter (pdf).
Results with Pfam and TreeBase test alignments: Excel, Text file.

Note

Model columns display log-likelihood values.
WAG' is estimated from our large, Pfam alignment database, but using procedure analogous to original WAG.
All models, unless explicitly stated (-G-I, or +G-I) are used with +G4+I option.
As PhyML does evolve (e.g. with improved tree search algorithms), slight changes could be observed along the course of time.

LG: An Improved, General Amino-Acid Replacement Matrix

Download

Note

See also :