ATGC: SDM

SDM: a Fast Distance-based Approach for (Super)Tree Building in Phylogenomics.

Criscuolo A., Berry V., Douzery E.J.P., Gascuel O. Systematic Biology. 2006 55(5):740-755.

Please cite THIS paper if you use SDM.

Running SDM

SDM uses a PHYLIP-like interface (Felsenstein, 1993). The user is asked for the name of the input file.
This input file contains either a collection of k distance matrices in PHYLIP format (lower triangular or square), or a collection of k trees with branch lengths (rooted or unrooted, binary or non-binary, but non bootstraped) in NEWICK format. The value of k must be given before the collection. Comments can be written inside the input file if the line begins with the '%' character.

Here is an example of input file containing square distance matrices:

Here is an other input file containing trees:

SDM outputs several files:

The distance supermatrix, called sdm_output, where missing entries (if any) are noted -99.0
The list of gene rates (the (1/α_p) values) as estimated by SDM, called sdm_rates
A table indicating the taxa covered by each gene, called sdm_tab; this file also indicates whether there is at least one distance measurement per taxon pair, in which case the distance supermatrix is complete and has no missing entries. In this case all tree building algorithms can be used to infer the supertree (e.g. FastME, recommended). Else, we recommend using MVR* (or BioNJ*) from our PhyD* package.

SDM also provides the deformed source matrices, when option (4) is checked (see below); the corresponding file is called sdm_deformed_matrices.

To run SDM on LINUX, use the command:

java -jar SDM.jar

To run SDM on WINDOWS, double-click on win_SDM.bat

PHYLIP-like interface

A PHYLIP-like menu display the various options:

SDM PHYLIP-like interface

Options

D Method (SDM, SDM*, ACS97)?
The default option is full SDM. SDM* is a restricted version, which is faster than SDM but does not use all the flexibility of SDM (a_ip variables are forced to be zero). ACS97 implements Average Consensus Supertree method, as described in (Lapointe and Cucumel, 1997).
T Input (Matrices, Trees)?
Option T indicates if the source data are distance matrices or trees with branch lengths.
L Lower-triangular data matrix?
In the case where source data are distance matrices, the L option allows to indicate if they are in lower-triangular or square format.
W Matrix weight? or Tree weight?
SDM (and SDM*) allows a confidence value (weight) to be associated to each source matrix (tree). This value must be written inside the input file, just after and on the same line as the taxon number (with matrices), or on a separated line before each tree. For example:

The length of the sequences from which the data have been inferred is a relevant and statistically well-founded weight. Default gives the same weight to every matrix (or tree).
S Weight matrices (or trees) using their size?
Option S allows to weight matrices (or trees) by the inverse of the taxon number, or by the inverse of the square of the taxon number. This weight is multiplied by the previous confidence value. This option can be used to compensate for the (too) low influence of matrices (or trees) with few taxa.
M Analyse multiple collections?
Option M allows to treat multiple collections of matrices (or trees) given one after the other in the input file.
0 Output format (Phylip, T-rex)?
Just a few programs are able to build trees from incomplete distance matrices: FITCH (Felsenstein, 1997) from PHYLIP package, are T-REX (Makarenkov, 2001), and all PhyD* algorithms. FITCH requires the subreplicate (Phylip) format. PhyD* also uses a Phylip format, but subreplicates are not mandatory as missing entries are written as -99.0. T-REX format is special: missing entries are indicated by -99.0, and the taxa are implicitely numbered and their names are removed. SDM then outputs an extra file called taxa
With complete matrices a number of other programs can be used, e.g. FastME (Desper and Gascuel, 2002) that uses the Phylip square format (without subreplicates).
This option allows to select Phylip (standard) format, or T-REX format.
1 Output supermatrix in subreplicate format?
Option 1 provides the output file in PHYLIP subreplicate format. This format associates a weigth of 0 to missing entries and a weigth of 1 to the existing entries. This is the format required by FITCH to deal with incomplete distance matrices.
2 Output supermatrix (Lower-triangular, Square)?
Option 2 defines the output format: lower-triangular or square.
3 Write out rates onto file?
Option 3 writes the list of gene rates (the (1/α_p) values) as estimated by SDM (or SDM*) in file sdm_rates.
4 Write out deformed matrices onto file?
Option 4 writes the deformed source matrices in sdm_deformed_matrices file.
5 Write out variances onto file?
Option 5 computes and writes the variance of each entry inside the supermatrix of distance in sdm_output_variance file. This variance matrix will be useful when running MVR*.