ATGC: PhyML-structure

Accounting for Solvent Accessibility and Secondary Structure in Protein Phylogenetics is Highly Beneficial.

Le S.Q., Gascuel O.

Systematic Biology, 59(3): 277-287, 2010

Input data

Alignments are provided in PHYLYP sequential format, followed by DSSP secondary-structure and solvent-exposure annotations , using Stockholm format. PhyML-structure simplifies these annotations by:

Secondary structure: E (extended or E in DSSP), H (helix or H in DSSP), and other structures S, T, B, G, I, C, ".", "X", or "?" in DSSP. PhyML-structure regards structures differing from E and H as other (O). X and ? correspond to unknown values and are dealt with using a mixture (with CONF/MIX) or LG (with CONF/LG and PART).
Classifying the sites into 10 relative surface accessibility categories: [0-9X] where (0=0%-10%; ...; 9=90%-100%). PhyML-structure considers 0 as buried values and [1-9] as exposed values.
Following Stockholm format, Secondary Structure and Surface Accessibility notations are coded by
```
     #=GR  SS        Secondary Structure    For protein [HGIEBTSCX]
     #=GR  SA        Surface Accessibility  [0-9X]  (0=0%-10%; ...; 9=90%-100%)
```
In the provided alignments, we include secondary structure information, surface accessibility, and original solvent exposure values.

See example.

Model

Amino-acid based models : EX2 (default) | EX3 | EHO | EX_EHO |UL2 | UL3 | LG | WAG | JTT

EX2: two-matrix model corresponding to exposed/buried sites.
EX3: three-matrix model corresponding to highly exposed/intermediate/buried sites*.
EHO: three-matrix model corresponding to extended/helix/other sites.
EX_EHO: six-matrix model corresponding to exposed/buried and extended/helix/other sites.

UL2: two-matrix model learned by unsupervised data*.
UL3: three-matrix model learned by unsupervised data*.

(*) We could extract three rate categories of EX3 by cutting relative solvent exposure values as [0-.08], [0.08-0.36] and [0.36 –1] for S (slow), M (medium) and F (fase).
See Le S.Q, Lartillot N. Gascuel O. (2008).
Phylogenetic Mixture Models for Proteins,
Philosophical Transactions of the Royal Society B, Vol. 363 (1512), 3965-3976.

Mode

CONF/MIX: combination of PART and MIX with confidence leve c. We apply cPART + (1-c) MIX to sites with structure information and MIX to sites without structure information. The stat output file includes c and information of MIX (see example for more detail).
CONF/LG: combination of PART and LG with confidence leve c. We apply cPART + (1-c) LG to sites with structure information and MIX to sites without structure information. The stat output file includes c (see example for more detail).
PART: We apply the corresponding models for sites with structure information and LG to sites without structure information (see example for more detail).
MIX: We apply MIX to all sites. The stat output file includes information of MIX (see example for more detail).

Running PhyML-structure

phyml-structure [command args]

Command options:

--help: instructions
-i (or --input) seq_file_name
seq_file_name is the name of the nucleotide or amino-acid sequence file in PHYLIP format.
-b (or --bootstrap) int
- int > 0 : int is the number of bootstrap replicates.
- int = 0 : neither approximate likelihood ratio test nor bootstrap values are computed.
- int = -1 : approximate likelihood ratio test returning aLRT statistics.
- int = -2 : approximate likelihood ratio test returning Chi2-based parametric branch supports.
- int = -3 : minimum of Chi2-based parametric and SH-like branch supports.
- int = -4 : SH-like branch supports alone (default).
-m (or --model) model
model : substitution model name.
Amino-acid based models : EX2 (default) | EX3 | EHO | EX_EHO |UL2 | UL3 | LG | WAG | JTT
- EX2: two-matrix model corresponding to exposed/buried sites.
- EX3: three-matrix model corresponding to highly exposed/intermediate/buried sites.
- EHO: three-matrix model corresponding to extended/helix/other sites.
- EX_EHO: six-matrix model corresponding to exposed/buried and extended/helix/other sites.
- UL2: two-matrix model learned by unsupervised data.
- UL3: three-matrix model learned by unsupervised data.
See Le S.Q, Lartillot N. Gascuel O. (2008).
Phylogenetic Mixture Models for Proteins,
Philosophical Transactions of the Royal Society B, Vol. 363 (1512), 3965-3976.
-M: mode [PART] [MIX] [CONF/MIX] [CON/LG]
- PART: Partitioning models. LG is used for sites with missing structure notations
- MIX : Mixture models
- CONF/MIX: confidence-based model, using MIX for poorly annotated sites
- CONF/LG : confidence-based model, using LG for poorly annotated sites
Default: CONF/MIX when alignments have structure information.
-y:
Alignments are provided in PHYLYP sequential format, followed by DSSP secondary-structure and solvent-exposure annotations , using Stockholm format. PhyML-structure simplifies these annotations by:
1. Secondary structure: E (extended or E in DSSP), H (helix or H in DSSP), and other structures S, T, B, G, I, C, ., X, or ? in DSSP. PhyML-structure regards structures differing from E and H as other (O). X and ? correspond to unknown values and are dealt with using a mixture (with CONF/MIX) or LG (with CONF/LG and PART).
2. Classifying the sites into 10 relative surface accessibility categories: [0-9X] where (0=0%; ...; 9=90%). PhyML-structure considers 0 as buried values and [1-9] as exposed values.
3. Following Stockholm format, Secondary Structure and Surface Accessibility notations are coded by
```
     #=GR  SS        Secondary Structure    For protein [HGIEBTSCX]
     #=GR  SA        Surface Accessibility  [0-9X]  (0=0%-10%; ...; 9=90%-100%)
```
-v (or --pinv) prop_invar
prop_invar : proportion of invariable sites.
Can be a fixed value in the [0,1] range or e to get the maximum likelihood estimate. Default [v = 0].
-c (or --nclasses) nb_subst_cat
nb_subst_cat : number of relative substitution rate categories. Default : [nb_subst_cat=4].
Must be a positive integer.
-a (or --alpha) gamma
gamma : distribution of the gamma distribution shape parameter.
Can be a fixed positive value or e to get the maximum likelihood estimate. Default [e].
-s (or --search) move
Tree topology search operation option.
Can be either NNI (default) or SPR.
-u (or --inputtree) user_tree_file
user_tree_file : starting tree filename. The tree must be in Newick format.
-o params
This option focuses on specific parameter optimisation.
- params=tlr : tree topology (t), branch length (l) and rate parameters (r) are optimised.
- params=tl : tree topology and branch length are optimised.
- params=lr : branch length and rate parameters are optimised.
- params=l : branch length are optimised.
- params=r : rate parameters are optimised.
- params=n : no parameter is optimised.
--rand_start
This option sets the initial tree to random.
It is only valid if SPR searches are to be performed.
--n_rand_starts num
num is the number of initial random trees to be used.
It is only valid if SPR searches are to be performed.
--r_seed num
num is the seed used to initiate the random number generator.
Must be an integer.
--print_site_lnl
Print the likelihood for each site in file *_phyml_lk.txt.
--print_trace
Print each phylogeny explored during the tree search process in file *_phyml_trace.txt.

PHYLIP-Like interface

You can use phyml with no arguments, in this case change the value of a parameter by typing its corresponding character as shown on screen.

Examples

    ./PhyML-SS -i Ord0300_2hhi.STR -m EX2 -M PART -c 4 -a e