CAT : Empirical profile mixture models for phylogenetic reconstruction.
Le S.Q., Gascuel O., Lartillot N. Bioinformatics. 2008 Oct 15;24(20):231723.
Please cite
THESE papers if you use CAT.
Introduction to Empirical Profile Mixture Models.
The standard model : empirical matrices
Standard phylogenetic models used for analysing protein sequences assume that the
patterns of aminoacid replacements are identical across the sequence. They are conveniently summarised in terms of a 20x20
ratematrix, specifying the rate of substitution between each pair of aminoacids. Biochemical realism of such matrices translates into higher rates of substitution between biochemically similar aminoacids (e.g. Isoleucine and Valine). Concerning the exact values of those rates, there are two main attitudes :

The GTR approach : all the parameters of the matrix are learnt directly on the dataset under investigation, along with the othe parameters of the model (topology of the tree, branch lengths, etc.). Given the number of additional parameters entailed by a timereversible 20x20 matrix, this works well only if the dataset is big enough.

The empirical approach : the parameters of the matrix have been learnt on a separate database, based on several dozens of hundreds of singlegene alignments. Such prelearnt empirical matrices are available from several sources (WAG, JTT, LG).
An alternative approach : profile mixture models
Over the last few years, we proposed a simple alternative to empirical rate matrices, by using mixtures of stationary probability profiles (
Lartillot and Philippe, 2004). Such mixture models explicitely account for the fact that distinct sites are under distinct evolutionary pressures. Through the underlying mixture, the model implicitely clusters sites according to their class of biochemical constraint (hydrophobic, polar, positively charged, etc.). And to each class is associated a probability profile over the 20 aminoacids.
In several instances, we showed that such mixture models provide a better fit than standard models. They perform particularly well on saturated data, and for that reason, are more robust to phylogenetic artefacts due to the presence of fast evolving species in the dataset (
Lartillot et al, 2007).
However, such profile mixture models were introduced only in a Bayesian context, and were not available in a Maximum Likelihood framework. In addition, thus far, no empirical information was stored a priori in the model concerning the shapes of the profiles. To draw a parallel with standard models, we only implemented the equivalent of the GTR approach, which means that the model could be applied only on large datasets.
Empirical profile mixture models
Here, we introduce a series of empirically determined profile mixture models, with number of components ranging from 20 to 60. In a way, we can say that they are to our previous
CAT model what WAG or JTT are to the GTR model : simply, a prelearnt version of the model, which can now be used for analysing small datasets, while explicitely accounting for sitespecific effects.
We demonstrate that these profile mixtures provide a better statistical fit than currently available empirical matrices (WAG, JTT), in particular on saturated data. They have been implemented in the two phylogenetic softwares PhyML and PhyloBayes. Under PhyML, the C20 model is a good compromise between efficiency and accuracy. Under PhyloBayes, on the other and, all models, from C20 to C60 have about the same computational efficiency. Therefore, in that case, it is probably better to always use C50 or C60.