CAT : Empirical profile mixture models for phylogenetic reconstruction.

Le S.Q., Gascuel O., Lartillot N. Bioinformatics. 2008 Oct 15;24(20):2317-23.

Please cite THESE papers if you use CAT.

Profile Mixture Models : Datasets.

Training datasets

We have thee databases : two that were used for learning, and derived from HSSP and HOGENOM. And one, derived from TreeBase, used for testing the models. For each database, we provide the alignments, and the phylogenetic trees that were used.

Method to extract the training database from HSSP

Our training alignment database was extracted from HSSP. This database comprises 35,000 alignments of protein families, each usually containing numerous members (550 in average). Each alignment is obtained by aligning a protein with known 3-D structure in the Protein Data Bank (PDB), to all its likely sequence homologues in SWISS-PROT. The protein with known structure is named the "test protein" of the alignment. HSSP is highly redundant. Typically, a protein may be the test protein of a given alignment and belongs to all alignments corresponding to its homologues with known structure. Moreover, HSSP alignments often contain a huge number of gaps, mainly due to absent or unsequenced domains for some proteins. We thus performed an intensive cleaning of HSSP to extract independent alignments and, within each of the alignments, to select sequences and sites corresponding to well aligned, non-gapped regions. Moreover, we only selected globular proteins, and thus discarded membrane proteins that show clearly different amino-acid replacement patterns. This cleaning process involved three steps :
  1. We first selected within each of HSSP alignments a subset of sequences sharing several properties : at least 10 sequences ; the number of sites without gaps (within these sequences) should be larger than 100 and than 2 times the number of sequences in the subset ; the percentage of identities between any sequence pair should be informative, i.e. neither too low (> 0.40) nor too high (< 0.99). This selection process was performed in a greedy way, starting from the test protein and stopping when no more sequence in the alignment satisfied the properties. When no such sequence subset was found, the alignment was discarded. A priority function was calculated for each retained alignment, corresponding to the number of amino-acids in the selected sequence subset.
  2. We extracted a maximal set of independent alignments, based on their priority and using SWISS-PROT identifiers. Selected alignments do not share any common identifier and thus correspond to clearly distinct protein families. This was achieved in a greedy way, starting from the alignment with highest priority, adding at each step the highest priority alignment that is independent from all the already selected alignments, and stopping when no more such alignment was found. This procedure selected 1,771 independent alignments with high priority, i.e. containing a large number of long sequences with few gaps.
  3. Finally, for each of these 1,771 alignments we discarded the sequences that were not selected in the first step, and applied GBLOCKS (Castresana 2000, Mol Biol Evol 17(4):540-52) with default options, to obtain clean sub-alignments with well-conserved blocks and ready to be used in a phylogenetic context. These 1,771 sub-alignments (alignments for short from now) are well suited to estimate amino-acid replacement models for globular proteins, and we used them for several purposes, some outside the scope of this paper. These alignments are also relevant to estimate CAT profiles, mainly because they contain many sequences, but some are too large regarding the computing time needed by our estimation procedure. In this study we thus removed the largest ones, retaining 1,030 large enough alignments with an average of 40 sequences and 253 sites per alignment.