PhyD*: Fast NJ-like algorithms to deal with incomplete distance matrices.
Criscuolo A., Gascuel O.
BMC Bioinformatics. 2008, Mar 26;9:166.
Please cite
THIS paper if you use PhyD*.
Running PhyD*
PhyD* uses a PHYLIP-like interface (Felsenstein, 1993). The user is asked for the name of the input file. This file contains one (or several) distance matrix in PHYLIP format (lower triangular or square). A missing entry must be written
-99.0. Binary values (
Subreplicate option) may be associated just after each matrix entry; the 0 value indicates a missing entry and the 1 value indicates a non-missing entry. Comments can be written inside the input file if the line begins with the '%' character.
Here is an example of
input.d file containing square distance matrices:
The same three distance matrices with both lower triangular and subreplicate formats:
PhyD* outputs the phylogenetic tree(s) (in NEWICK format) inside a file called
output.t.
To run
PhyD* with LINUX, use the command:
java -jar PhyDstar.jar
To run
PhyD* with WINDOWS, double-click on
win_PhyDstar.bat.
PHYLIP-like interface
A PHYLIP-like menu display the various options:
Options
-
D Method (NJ*, UNJ*, BioNJ*, MVR*)?
The four available algorithms NJ*, UNJ*, BioNJ* and MVR* are adaptations of NJ, UNJ, BioNJ and MVR, respectively. They all correspond to the initial algorithm when the input distance matrix is complete and when the P option is set to 1 (see below).
When MVR* is selected, the program uses the input distance matrix to compute the variance of the pairwise evolutionary distance estimates. However, the V option allows the user to select a file containing the variance matrix with the same format as the distance matrix file. Default algorithm is BioNJ*, which is both simple and fairly accurate with standard evolutionary distances. In a supertree context (i.e. when dealing with multiple-gene datasets), we recommend: (1) to use SDM to compute the distances and their variances, and (2) to analyse the so-obtained distance and variance matrices with MVR*. With unusual distance matrices, e.g. based on DNA-DNA hybridization or on morphological characters, UNJ* should be preferred.
The difference between these three algorithms lies in the variance model they use for the distance estimates. BioNJ* uses a model corresponding to one-gene analysis. MVR* uses the SDM variances estimated by accounting (among others) for the length and the number of sequences. UNJ* is based on the ordinary least-square model, which can be seen as the null average model. NJ* is provided as well, but appeared as the worst approach in our simulation studies.
-
P Taxon pairs selected by NJ-like filtering?
These four algorithms are based on several criteria to select the best taxon pair to be agglomerated at each step. Most of these criteria are time consuming. Thus, we first select a few pairs using a NJ-like criterion, which is fast but moderately accurate, and then apply the other criteria to the selected pairs to find the best one. Our experiments showed that selecting 10-20 taxons pairs is usually enough to obtain very good performance. Thus default is 15. Augmenting this value should not change much the output, but will be time consuming. Decreasing this value will accelerate the computations at the expense of a loss of accuracy.
-
N Negative branch lengths allowed?
This option sets all negative branch lengths to 0.
-
B Binary tree?
If the this option is set to No, all zero length branches are transformed to multifurcation in the output tree. This option should be combined with the previous one to collapse all negative and zero length branches.
-
O Outgroup root?
Default option of PhyD* is to output unrooted trees. When the O option is turned on, PhyD* prompts for the name of the species to be used to root the tree. Then, it returns a tree which is specified with a trifurcation at its base, and the root species is one of the elements of this trifurcation. In fact, when the O option is turned off, PhyD* uses the first taxon in the distance matrix to root the tree, just as all PHYLIP programs.
-
L Lower-triangular data matrix?
This option indicates that the distance matrix is input in Square or Lower-triangular form (the lower-left half of the distance matrix only, without the zero diagonal elements). The default option is Lower-triangular.
-
S Subreplicate?
This option is to be used if the input distance matrix is in subreplicate format (see above).
-
M Analyse multiple matrices?
This option allows to treat multiple matrices given one after the other in the input file. Then, the output file provides the corresponding trees in the same order.