CAT : Empirical profile mixture models for phylogenetic reconstruction.

Le S.Q., Gascuel O., Lartillot N. Bioinformatics. 2008 Oct 15;24(20):2317-23.

Please cite THESE papers if you use CAT.

Introduction to Empirical Profile Mixture Models.

The standard model : empirical matrices

Standard phylogenetic models used for analysing protein sequences assume that the patterns of amino-acid replacements are identical across the sequence. They are conveniently summarised in terms of a 20x20 rate-matrix, specifying the rate of substitution between each pair of amino-acids. Biochemical realism of such matrices translates into higher rates of substitution between biochemically similar amino-acids (e.g. Isoleucine and Valine). Concerning the exact values of those rates, there are two main attitudes :

An alternative approach : profile mixture models

Over the last few years, we proposed a simple alternative to empirical rate matrices, by using mixtures of stationary probability profiles (Lartillot and Philippe, 2004). Such mixture models explicitely account for the fact that distinct sites are under distinct evolutionary pressures. Through the underlying mixture, the model implicitely clusters sites according to their class of biochemical constraint (hydrophobic, polar, positively charged, etc.). And to each class is associated a probability profile over the 20 amino-acids.

In several instances, we showed that such mixture models provide a better fit than standard models. They perform particularly well on saturated data, and for that reason, are more robust to phylogenetic artefacts due to the presence of fast evolving species in the dataset (Lartillot et al, 2007).

However, such profile mixture models were introduced only in a Bayesian context, and were not available in a Maximum Likelihood framework. In addition, thus far, no empirical information was stored a priori in the model concerning the shapes of the profiles. To draw a parallel with standard models, we only implemented the equivalent of the GTR approach, which means that the model could be applied only on large datasets.

Empirical profile mixture models

Here, we introduce a series of empirically determined profile mixture models, with number of components ranging from 20 to 60. In a way, we can say that they are to our previous CAT model what WAG or JTT are to the GTR model : simply, a pre-learnt version of the model, which can now be used for analysing small datasets, while explicitely accounting for site-specific effects.

We demonstrate that these profile mixtures provide a better statistical fit than currently available empirical matrices (WAG, JTT), in particular on saturated data. They have been implemented in the two phylogenetic softwares PhyML and PhyloBayes. Under PhyML, the C20 model is a good compromise between efficiency and accuracy. Under PhyloBayes, on the other and, all models, from C20 to C60 have about the same computational efficiency. Therefore, in that case, it is probably better to always use C50 or C60.