Phylogenetic Mixture Models for Proteins
Le S.Q., Lartillot N., Gascuel O.
Philosophical Transactions of the Royal Society B. 2008, 363:3965.3976.
Standard protein substitution models use a single amino-acid replacement rate matrix which summarizes the biological, chemical and physical properties of amino acids. However, site evolution is highly heterogeneous and depends on many factors: genetic code, solvent exposition, secondary and tertiary structure, protein function, etc. These impact the substitution pattern, and in most cases a single replacement matrix is not enough to represent all the complexity of the evolutionary processes. This paper explores in a maximum-likelihood framework phylogenetic mixture models, which combine several amino-acid replacement matrices to better fit protein evolution. We learn these mixture models from a large alignment database extracted from HSSP, and test the performance using independent alignments from Treebase. We compare unsupervised learning approaches, where the site categories are unknown, to supervised ones, where we use in estimations the known category of each site, based on its exposition to solvent or its secondary structure. All our models are combined with gamma distributed rates across sites. Results show that highly significant likelihood gains are obtained when using mixture models, compared to the best available single replacement matrices. Mixtures of matrices also improve over mixtures of profiles in the line of the CAT model. The unsupervised approach tends to be better than the supervised one, but it appears difficult to implement and highly sensitive to the starting values of the parameters, meaning that the supervised approach is still of interest for initialization. Using an unsupervised model involving 3 matrices, the average AIC gain per site with Treebase test alignments is 0.28, 0.46 and 0.59, compared to LG, WAG and JTT, respectively. This 3-matrix model is significantly better than LG for 36 alignments (among 57), and significantly worse with 1 alignment only. Moreover, tree topologies inferred with our mixture models frequently differ from those obtained with single matrices, indicating that using these mixtures impact the likelihood value but also the output tree.
Models in details
EX2: two-matrix model based on accessibility to solvent (buried/exposed).
EX3: three-matrix model based on accessibility to solvent (buried/intermediate/highly-exposed).
EHO: three-matrix model based on secondary structure (extended/helix/other).
UL2: two-matrix unsupervised model.
UL3: three-matrix unsupervised model.
Phylogenetic Mixture Models for Proteins paper (pdf).
PhyML-mixtures: PhyML version for mixture of matrix models (EX2, EX3, EHO, UL2, and UL3). This version also implements PhyML-Structure.
PhyML with profile models (C10-C60).
UL3 in PAML format.
All models in one excel sheet,
or in text files amino-acid frequencies,
and rates and proportions.
All results with Treebase test alignments: Excel sheet.
These alignments are available from LG web page (but remove the two large phylogenomic alignments).
See also :