Accounting for Solvent Accessibility and Secondary Structure in Protein Phylogenetics is Highly Beneficial.
Le S.Q., Gascuel O.
Systematic Biology 2010 (in press)
Input data
Alignments are provided in PHYLYP sequential format, followed by DSSP secondary-structure and solvent-exposure annotations , using Stockholm format. PhyML-structure simplifies these annotations by:
-
Secondary structure: E (extended or E in DSSP), H (helix or H in DSSP), and other structures S, T, B, G, I, C, ".", "X", or "?" in DSSP. PhyML-structure regards structures differing from E and H as other (O). X and ? correspond to unknown values and are dealt with using a mixture (with CONF/MIX) or LG (with CONF/LG and PART).
-
Classifying the sites into 10 relative surface accessibility categories: [0-9X] where (0=0%-10%; ...; 9=90%-100%). PhyML-structure considers 0 as buried values and [1-9] as exposed values.
-
Following Stockholm format, Secondary Structure and Surface Accessibility notations are coded by
#=GR SS Secondary Structure For protein [HGIEBTSCX]
#=GR SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%)
In the provided alignments, we include secondary structure information, surface accessibility, and original solvent exposure values.
See
example.
Model
Amino-acid based models : EX2 (default) | EX3 | EHO | EX_EHO |UL2 | UL3 | LG | WAG | JTT
-
EX2: two-matrix model corresponding to exposed/buried sites.
-
EX3: three-matrix model corresponding to highly exposed/intermediate/buried sites*.
-
EHO: three-matrix model corresponding to extended/helix/other sites.
-
EX_EHO: six-matrix model corresponding to exposed/buried and extended/helix/other sites.
-
UL2: two-matrix model learned by unsupervised data*.
-
UL3: three-matrix model learned by unsupervised data*.
(*) We could extract three rate categories of EX3 by cutting relative solvent exposure values as [0-.08], [0.08-0.36] and [0.36 –1] for S (slow), M (medium) and F (fase).
See Le S.Q, Lartillot N. Gascuel O. (2008).
Phylogenetic Mixture Models for Proteins,
Philosophical Transactions of the Royal Society B, Vol. 363 (1512), 3965-3976.
Mode
-
CONF/MIX: combination of PART and MIX with confidence leve c. We apply cPART + (1-c) MIX to sites with structure information and MIX to sites without structure information. The stat output file includes c and information of MIX
(see example for more detail).
-
CONF/LG: combination of PART and LG with confidence leve c. We apply cPART + (1-c) LG to sites with structure information and MIX to sites without structure information. The stat output file includes c (see example for more detail).
-
PART: We apply the corresponding models for sites with structure information and LG to sites without structure information (see example for more detail).
-
MIX: We apply MIX to all sites. The stat output file includes information of MIX (see example for more detail).
Running PhyML-structure
phyml-structure [command args]
Command options:
-
--help: instructions
-
-i (or --input) seq_file_name
seq_file_name is the name of the nucleotide or amino-acid sequence file in PHYLIP format.
-
-b (or --bootstrap) int
-
int > 0 : int is the number of bootstrap replicates.
-
int = 0 : neither approximate likelihood ratio test nor bootstrap values are computed.
-
int = -1 : approximate likelihood ratio test returning aLRT statistics.
-
int = -2 : approximate likelihood ratio test returning Chi2-based parametric branch supports.
-
int = -3 : minimum of Chi2-based parametric and SH-like branch supports.
-
int = -4 : SH-like branch supports alone (default).
-
-m (or --model) model
model : substitution model name.
Amino-acid based models : EX2 (default) | EX3 | EHO | EX_EHO |UL2 | UL3 | LG | WAG | JTT
-
EX2: two-matrix model corresponding to exposed/buried sites.
-
EX3: three-matrix model corresponding to highly exposed/intermediate/buried sites.
-
EHO: three-matrix model corresponding to extended/helix/other sites.
-
EX_EHO: six-matrix model corresponding to exposed/buried and extended/helix/other sites.
-
UL2: two-matrix model learned by unsupervised data.
-
UL3: three-matrix model learned by unsupervised data.
See Le S.Q, Lartillot N. Gascuel O. (2008).
Phylogenetic Mixture Models for Proteins,
Philosophical Transactions of the Royal Society B, Vol. 363 (1512), 3965-3976.
-
-M: mode [PART] [MIX] [CONF/MIX] [CON/LG]
-
PART: Partitioning models. LG is used for sites with missing structure notations
-
MIX : Mixture models
-
CONF/MIX: confidence-based model, using MIX for poorly annotated sites
-
CONF/LG : confidence-based model, using LG for poorly annotated sites
Default: CONF/MIX when alignments have structure information.
-
-y:
Alignments are provided in PHYLYP sequential format, followed by DSSP secondary-structure and solvent-exposure annotations , using Stockholm format. PhyML-structure simplifies these annotations by:
-
Secondary structure: E (extended or E in DSSP), H (helix or H in DSSP), and other structures S, T, B, G, I, C, ., X, or ? in DSSP. PhyML-structure regards structures differing from E and H as other (O). X and ? correspond to unknown values and are dealt with using a mixture (with CONF/MIX) or LG (with CONF/LG and PART).
-
Classifying the sites into 10 relative surface accessibility categories: [0-9X] where (0=0%; ...; 9=90%). PhyML-structure considers 0 as buried values and [1-9] as exposed values.
-
Following Stockholm format, Secondary Structure and Surface Accessibility notations are coded by
#=GR SS Secondary Structure For protein [HGIEBTSCX]
#=GR SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%)
-
-v (or --pinv) prop_invar
prop_invar : proportion of invariable sites.
Can be a fixed value in the [0,1] range or e to get the maximum likelihood estimate. Default [v = 0].
-
-c (or --nclasses) nb_subst_cat
nb_subst_cat : number of relative substitution rate categories. Default : [nb_subst_cat=4].
Must be a positive integer.
-
-a (or --alpha) gamma
gamma : distribution of the gamma distribution shape parameter.
Can be a fixed positive value or e to get the maximum likelihood estimate. Default [e].
-
-s (or --search) move
Tree topology search operation option.
Can be either NNI (default) or SPR.
-
-u (or --inputtree) user_tree_file
user_tree_file : starting tree filename. The tree must be in Newick format.
-
-o params
This option focuses on specific parameter optimisation.
-
params=tlr : tree topology (t), branch length (l) and rate parameters (r) are optimised.
-
params=tl : tree topology and branch length are optimised.
-
params=lr : branch length and rate parameters are optimised.
-
params=l : branch length are optimised.
-
params=r : rate parameters are optimised.
-
params=n : no parameter is optimised.
-
--rand_start
This option sets the initial tree to random.
It is only valid if SPR searches are to be performed.
-
--n_rand_starts num
num is the number of initial random trees to be used.
It is only valid if SPR searches are to be performed.
-
--r_seed num
num is the seed used to initiate the random number generator.
Must be an integer.
-
--print_site_lnl
Print the likelihood for each site in file *_phyml_lk.txt.
-
--print_trace
Print each phylogeny explored during the tree search process in file *_phyml_trace.txt.
PHYLIP-Like interface
You can use phyml with no arguments, in this case change the value of a parameter by typing its corresponding character as shown on screen.
Examples
./PhyML-SS -i Ord0300_2hhi.STR -m EX2 -M PART -c 4 -a e