Accounting for Solvent Accessibility and Secondary Structure in Protein Phylogenetics is Highly Beneficial.
Le S.Q., Gascuel O.
Systematic Biology, 59(3): 277-287, 2010
Input data
Alignments are provided in PHYLYP sequential format, followed by DSSP secondary-structure and solvent-exposure annotations , using Stockholm format. PhyML-structure simplifies these annotations by:
-
Secondary structure: E (extended or E in DSSP), H (helix or H in DSSP), and other structures S, T, B, G, I, C, ".", "X", or "?" in DSSP. PhyML-structure regards structures differing from E and H as other (O). X and ? correspond to unknown values and are dealt with using a mixture (with CONF/MIX) or LG (with CONF/LG and PART).
-
Classifying the sites into 10 relative surface accessibility categories: [0-9X] where (0=0%-10%; ...; 9=90%-100%). PhyML-structure considers 0 as buried values and [1-9] as exposed values.
-
Following Stockholm format, Secondary Structure and Surface Accessibility notations are coded by
#=GR SS Secondary Structure For protein [HGIEBTSCX]
#=GR SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%)
In the provided alignments, we include secondary structure information, surface accessibility, and original solvent exposure values.
See
example.
Model
Amino-acid based models : EX2 (default) | EX3 | EHO | EX_EHO |UL2 | UL3 | LG | WAG | JTT
-
EX2: two-matrix model corresponding to exposed/buried sites.
-
EX3: three-matrix model corresponding to highly exposed/intermediate/buried sites*.
-
EHO: three-matrix model corresponding to extended/helix/other sites.
-
EX_EHO: six-matrix model corresponding to exposed/buried and extended/helix/other sites.
-
UL2: two-matrix model learned by unsupervised data*.
-
UL3: three-matrix model learned by unsupervised data*.
(*) We could extract three rate categories of EX3 by cutting relative solvent exposure values as [0-.08], [0.08-0.36] and [0.36 –1] for S (slow), M (medium) and F (fase).
See Le S.Q, Lartillot N. Gascuel O. (2008).
Phylogenetic Mixture Models for Proteins,
Philosophical Transactions of the Royal Society B, Vol. 363 (1512), 3965-3976.
Mode
-
CONF/MIX: combination of PART and MIX with confidence leve c. We apply cPART + (1-c) MIX to sites with structure information and MIX to sites without structure information. The stat output file includes c and information of MIX
(see example for more detail).
-
CONF/LG: combination of PART and LG with confidence leve c. We apply cPART + (1-c) LG to sites with structure information and MIX to sites without structure information. The stat output file includes c (see example for more detail).
-
PART: We apply the corresponding models for sites with structure information and LG to sites without structure information (see example for more detail).
-
MIX: We apply MIX to all sites. The stat output file includes information of MIX (see example for more detail).
Running PhyML-structure
phyml-structure [command args]
Command options:
-
--help: instructions
-
-i (or --input) seq_file_name
seq_file_name is the name of the nucleotide or amino-acid sequence file in PHYLIP format.
-
-b (or --bootstrap) int
-
int > 0 : int is the number of bootstrap replicates.
-
int = 0 : neither approximate likelihood ratio test nor bootstrap values are computed.
-
int = -1 : approximate likelihood ratio test returning aLRT statistics.
-
int = -2 : approximate likelihood ratio test returning Chi2-based parametric branch supports.
-
int = -3 : minimum of Chi2-based parametric and SH-like branch supports.
-
int = -4 : SH-like branch supports alone (default).
-
-m (or --model) model
model : substitution model name.
Amino-acid based models : EX2 (default) | EX3 | EHO | EX_EHO |UL2 | UL3 | LG | WAG | JTT
-
EX2: two-matrix model corresponding to exposed/buried sites.
-
EX3: three-matrix model corresponding to highly exposed/intermediate/buried sites.
-
EHO: three-matrix model corresponding to extended/helix/other sites.
-
EX_EHO: six-matrix model corresponding to exposed/buried and extended/helix/other sites.
-
UL2: two-matrix model learned by unsupervised data.
-
UL3: three-matrix model learned by unsupervised data.
See Le S.Q, Lartillot N. Gascuel O. (2008).
Phylogenetic Mixture Models for Proteins,
Philosophical Transactions of the Royal Society B, Vol. 363 (1512), 3965-3976.
-
-M: mode [PART] [MIX] [CONF/MIX] [CON/LG]
-
PART: Partitioning models. LG is used for sites with missing structure notations
-
MIX : Mixture models
-
CONF/MIX: confidence-based model, using MIX for poorly annotated sites
-
CONF/LG : confidence-based model, using LG for poorly annotated sites
Default: CONF/MIX when alignments have structure information.
-
-y:
Alignments are provided in PHYLYP sequential format, followed by DSSP secondary-structure and solvent-exposure annotations , using Stockholm format. PhyML-structure simplifies these annotations by:
-
Secondary structure: E (extended or E in DSSP), H (helix or H in DSSP), and other structures S, T, B, G, I, C, ., X, or ? in DSSP. PhyML-structure regards structures differing from E and H as other (O). X and ? correspond to unknown values and are dealt with using a mixture (with CONF/MIX) or LG (with CONF/LG and PART).
-
Classifying the sites into 10 relative surface accessibility categories: [0-9X] where (0=0%; ...; 9=90%). PhyML-structure considers 0 as buried values and [1-9] as exposed values.
-
Following Stockholm format, Secondary Structure and Surface Accessibility notations are coded by
#=GR SS Secondary Structure For protein [HGIEBTSCX]
#=GR SA Surface Accessibility [0-9X] (0=0%-10%; ...; 9=90%-100%)
-
-v (or --pinv) prop_invar
prop_invar : proportion of invariable sites.
Can be a fixed value in the [0,1] range or e to get the maximum likelihood estimate. Default [v = 0].
-
-c (or --nclasses) nb_subst_cat
nb_subst_cat : number of relative substitution rate categories. Default : [nb_subst_cat=4].
Must be a positive integer.
-
-a (or --alpha) gamma
gamma : distribution of the gamma distribution shape parameter.
Can be a fixed positive value or e to get the maximum likelihood estimate. Default [e].
-
-s (or --search) move
Tree topology search operation option.
Can be either NNI (default) or SPR.
-
-u (or --inputtree) user_tree_file
user_tree_file : starting tree filename. The tree must be in Newick format.
-
-o params
This option focuses on specific parameter optimisation.
-
params=tlr : tree topology (t), branch length (l) and rate parameters (r) are optimised.
-
params=tl : tree topology and branch length are optimised.
-
params=lr : branch length and rate parameters are optimised.
-
params=l : branch length are optimised.
-
params=r : rate parameters are optimised.
-
params=n : no parameter is optimised.
-
--rand_start
This option sets the initial tree to random.
It is only valid if SPR searches are to be performed.
-
--n_rand_starts num
num is the number of initial random trees to be used.
It is only valid if SPR searches are to be performed.
-
--r_seed num
num is the seed used to initiate the random number generator.
Must be an integer.
-
--print_site_lnl
Print the likelihood for each site in file *_phyml_lk.txt.
-
--print_trace
Print each phylogeny explored during the tree search process in file *_phyml_trace.txt.
PHYLIP-Like interface
You can use phyml with no arguments, in this case change the value of a parameter by typing its corresponding character as shown on screen.
Examples
./PhyML-SS -i Ord0300_2hhi.STR -m EX2 -M PART -c 4 -a e