Fast dating using least-squares criteria and algorithms

To T.-H., Jung M., Lycett S. and Gascuel O.

ABSTRACT

Phylogenies provide a useful way to understand the evolutionary history of genetic samples, and data sets with more than thousand taxa are becoming increasingly common, notably with viruses (e.g. HIV). Dating ancestral events is one of the first, essential goals with such data. However, current sophisticated probabilistic approaches struggle to handle data sets of this size. Here we present very fast dating algorithms, based on a normal approximation of the Langley-Fitch molecular-clock model. These algorithms apply to serial data, where the tips of the tree are dated, and could be extended to trees with time calibration points. These algorithms estimate the substitution rate and the dates of all ancestral nodes. When the input tree is unrooted, they provide an estimate for the root position. The algorithms exploit the tree (recursive) structure of the problem at hand, and the close relationships between least-squares and linear algebra. We distinguish between the unconstrained setting and the case where the temporal precedence constraints are accounted for. With rooted trees, the former is solved using linear algebra in linear computing time (i.e. proportional to the number of taxa), while the resolution of the latter, constrained setting, is based on an active set method and runs in nearly linear time. With unrooted trees the computing time becomes (nearly) quadratic (i.e. proportional to the square of the number of taxa). In all cases very large input trees (>10,000 taxa) can easily be processed and transformed into time-scaled trees. We compare these two algorithms to standard methods (root-to-tip, r8s version of Langley-Fitch method and BEAST) and show on simulated data that their estimation accuracy is similar to that of the most sophisticated methods, while their computing time is much faster. Then, we apply these algorithms on two data sets: a small data set comprising 17 strains of Dengue virus and a large data set comprising 1,195 strains of Influenza virus from pdm09 H1N1 Human pandemic. Again the results show that these algorithms provide a very fast alternative with results similar to those of other computer programs. These algorithms are implemented in the LSD software (Least-Squares Dating), which can be downloaded from this web page, along with all our data sets and detailed results.


Paper

Software

LSD (least-square dating) program containing the LD and QPD algorithms:

Data sets

Simulated data sets:

Dengue virus data set:

H1N1pdm virus data set:

Contacts

Funding