PhyML 3.0 Benchmarks
Mediumsize data sets
Comparison of PhyML 3.0 tree search options and RAxML, using 100 DNA and protein alignments extracted from Treebase.
Distribution of relative computing times: for each of the 2 sets of alignments (50 DNA and 50 protein mediumsize alignments) we measured the base2 logarithm of the ratio of the computing time of the given method, and that of the fastest approach with the corresponding alignment. Thus, a logratio equals to X corresponds to a method being 2^X times slower than the fastest approach; e.g. with DNA alignments PhyML 2.4.5 NNI is basically twice faster than PhyML 3.0 NNI, but both are pretty much the same with protein alignments.
DNA  Av. LogLk rank  Delta>5  Pvalue<0.05  Av. RF distance 
PhyML 2.4.5  5.48  34  4  0.3 
PhyML 3.0 NNI  5.18  33  5  0.28 
PhyML 3.0 SPR  2.78  2  0  0.15 
PhyML 3.0 BEST  2.7  2  0  0.15 
PhyML 3.0 RAND  1.64  0  0  0.03 
RAxML  3.22  3  2  0.2 
PROTEIN  Av. LogLk rank  Delta>5  Pvalue<0.05  Av. RF distance 
PhyML 2.4.5  5.05  21  1  0.26 
PhyML 3.0 NNI  4.33  20  1  0.24 
PhyML 3.0 SPR  3.24  5  0  0.14 
PhyML 3.0 BEST  3.16  4  0  0.14 
PhyML 3.0 RAND  2.35  0  0  0.03 
RAxML  2.86  0  0  0.08 
Comparison of loglikelihoods on 50 DNA and 50 protein mediumsize data sets. The column ‘Av. LogLk rank’ gives the average loglikelihood ranks for the different methods. These ranks are corrected by taking into account information on tree topologies. ‘Delta>5’ gives the number of cases (among 50) for which the drops of loglikelihood between the method of interest and the highest loglikelihood for the corresponding data set is greater than 5. The column ‘pvalue<0.05’ displays the number of cases for which the difference of loglikelihood when comparing the method of interest to the corresponding highest loglikelihood is statistically significant (SH test). The ‘Av. RF distance’ values are the average Robinson and Foulds topological distances between the trees estimated by the method of interest and the corresponding most likely trees (0 corresponds to identical trees, while 1 means that the two trees do not have any clade in common).
Data sets
The benchmark contains 50 protein alignments and 50 DNA alignments.

DNA alignments
We selected the 50 most recent alignments from Treebase with at least 50 sequences, less than 200 sequences and less than 2000 sites.
 Protein alignments
We selected the 50 most recent alignments from Treebase with at least 5 sequences, less than 200 sequences and less than 2000 sites.
Hardware
All programs have been run on a cluster
Intel(R) Xeon(R) CPU 5140 @ 2.33GHz, 24 computing nodes, with 8GB of RAM for one bidualcore unit. Times can be compared because we have only considered effective computing times for the CPU.
Programs
6 programs and options have been compared. All programs were configured with the GTR model for DNA sequences, with WAG for proteins, and with 4 discrete gamma rate categories (alpha estimated from the data).
 PhyML 2.4.5
Previous version of PhyML, optimizing the topology with simultaneous NNIs (original PhyML algorithm), and using a BioNJ starting tree.
 PhyML 3.0 NNI
PhyML, optimizing the topology with both simultaneous NNIs (as in original PhyML algorithm) and refined NNIs with 5edgelength optimization, and using a BioNJ starting tree.
 PhyML 3.0 SPR
PhyML, optimizing the topology with SPR (and NNI 3.0) operations, and using a BioNJ starting tree.
 PhyML 3.0 BEST
PhyML, best tree obtained by PhyML 3.0 NNI and PhyML 3.0 SPR.
 PhyML 3.0 BEST RANDOM
PhyML, adding to the BEST option 5 SPR tree searches using random starting trees, output is the best of the 7 inferred trees.
 RAxML
RAxML version 7.0.
To obtain comparable results, the tree likelihood has been reoptimized by PhyML, keeping the topology but fitting all numerical parameters.
Results
Resulting trees are compared regarding topology, loglikelihood and computing time.
 Computing time ranks
The six methods are ranked for each of the alignments, based on the computing time. First rank contains methods with computing time ranging from the best (B) computing time to 1.25 X B (i.e. nearly best computing time). Remaining methods are ranked in the same way, until all methods are ranked. Ties are accounted for; e.g. if the first and second group contains 2 methods each, the ranks will be 1.5 ( (1+2)/2 ) and 3.5 ( (3+4)/2 ). To summarize these results, we provide the median and average ranks for all DNA and protein alignments.
 Topology ranks
The six methods are ranked for each of the alignments using a similar principle, based on the tree likelihood. First rank contains all methods which find the same best topology. And so on. Moreover, we provide the median and average ranks for all DNA and protein alignments.
 Robinson and Foulds distances
RF is the Robinson and Foulds (bipartition) distance between the best topology and the given topology.
 Delta>5
Another variable of interest is the number of times a method fails to find a phylogeny which loglikelihood is close to the highest loglikelihood found by any of the methods being compared. We thus counted the number of data sets for which the loglikelihoods returned by a given method was smaller than the highest loglikelihood found on the corresponding alignments minus 5.0. While this boundary of 5.0 points of loglikelihood is arbitrary, we believe that it provides a simple and practical way to tell the methods apart at first sight.
 SH tests
We used the ShimoidaraHasegawa (SH) test to assess the statistical significance of the likelihood differences. Every result displays the Pvalue between its logLk and the logLk of the best result for the same data. As a summary, we provide the number of times each method is significatively worst than the best one.